data_juicer_agents.tools#

Tools package for data-agent.

This module provides a unified entry point for all agent tools, organized by agent type for easy access and management.

data_juicer_agents.tools.agents2toolkit(agents: List[AgentBase])[源代码]#
async data_juicer_agents.tools.get_mcp_toolkit(config_path: str | None = None) Toolkit[源代码]#

Get toolkit with all MCP tools registered

async data_juicer_agents.tools.execute_safe_command(command: str, timeout: int = 300, **kwargs: Any) ToolResponse[源代码]#

Execute safe commands including DataJuicer commands and other safe system commands. Returns the return code, standard output and error within <returncode></returncode>, <stdout></stdout> and <stderr></stderr> tags.

参数:
  • command (str) -- The command to execute. Allowed commands include: - DataJuicer commands: dj-process, dj-analyze - File system commands: mkdir, ls, pwd, cat, echo, cp, mv, rm - Text processing: grep, head, tail, wc, sort, uniq - Archive commands: tar, zip, unzip - Other safe commands: which, whoami, date, find

  • timeout (float, defaults to 300) -- The maximum time (in seconds) allowed for the command to run.

返回:

The tool response containing the return code, standard output, and standard error of the executed command.

返回类型:

ToolResponse

async data_juicer_agents.tools.view_text_file(file_path: str, ranges: list[int] | None = None) ToolResponse[源代码]#

View the file content in the specified range with line numbers. If ranges is not provided, the entire file will be returned.

参数:
  • file_path (str) -- The target file path.

  • ranges -- The range of lines to be viewed (e.g. lines 1 to 100: [1, 100]), inclusive. If not provided, the entire file will be returned. To view the last 100 lines, use [-100, -1].

返回:

The tool response containing the file content or an error message.

返回类型:

ToolResponse

async data_juicer_agents.tools.write_text_file(file_path: str, content: str, ranges: None | list[int] = None) ToolResponse[源代码]#

Create/Replace/Overwrite content in a text file. When ranges is provided, the content will be replaced in the specified range. Otherwise, the entire file (if exists) will be overwritten.

参数:
  • file_path (str) -- The target file path.

  • content (str) -- The content to be written.

  • ranges (list[int] | None, defaults to None) -- The range of lines to be replaced. If None, the entire file will be overwritten.

返回:

The tool response containing the result of the writing operation.

返回类型:

ToolResponse

data_juicer_agents.tools.agent_to_tool(agent: AgentBase, tool_name: str = None, description: str = None) Callable[源代码]#

Convert any agent to a tool function that can be registered in toolkit.

参数:
  • agent -- The agent instance to convert

  • tool_name -- Optional custom tool name (defaults to agent.name)

  • description -- Optional tool description (defaults to agent's docstring or sys_prompt)

返回:

A tool function that can be registered with toolkit.register_tool_function()

async data_juicer_agents.tools.query_dj_operators(query: str, limit: int = 20) ToolResponse[源代码]#

Query DataJuicer operators by natural language description.

Retrieves relevant operators from DataJuicer library based on user query. Supports matching by functionality, data type, and processing scenarios.

参数:
  • query (str) -- Natural language operator query

  • limit (int) -- Maximum number of operators to return (default: 20)

返回:

Tool response containing matched operators with names, descriptions, and parameters

返回类型:

ToolResponse

async data_juicer_agents.tools.get_ops_signature(op_names: List[str]) ToolResponse[源代码]#

Get detailed information for specified DataJuicer operators.

This tool retrieves comprehensive operator metadata including signatures, parameter descriptions, and usage information. It's designed to help the data processing agent generate accurate YAML configuration files.

参数:

op_names (List[str]) -- List of operator names to query (e.g., ['text_length_filter', 'image_shape_filter'])

返回:

Detailed operator information including:
  • Operator type (Filter/Mapper/Deduplicator)

  • Function signature with parameter types

  • Parameter descriptions and default values

  • Usage examples or constraints

返回类型:

ToolResponse

示例

>>> get_ops_signature(['text_length_filter', 'image_face_count_filter'])
Returns detailed configuration info for both operators
data_juicer_agents.tools.get_basic_files() ToolResponse[源代码]#

Get basic DataJuicer development files content.

Returns the content of essential files needed for DJ operator development: - base_op.py: Base operator class - DeveloperGuide.md: English developer guide - DeveloperGuide_ZH.md: Chinese developer guide

返回:

Combined content of all basic development files

返回类型:

ToolResponse

async data_juicer_agents.tools.get_operator_example(operator_names: list) ToolResponse[源代码]#

Get example operators based on a list of operator names.

参数:

operator_names (list) -- List of operator names to get examples for

返回:

Example operator code and test files for the specified operators

返回类型:

ToolResponse

data_juicer_agents.tools.configure_data_juicer_path(data_juicer_path: str) ToolResponse[源代码]#

Configure DataJuicer path. If the user provides the data_juicer_path, please use this method to configure it.

参数:

data_juicer_path (str) -- Path to DataJuicer installation

返回:

Configuration result

返回类型:

ToolResponse

async data_juicer_agents.tools.get_advanced_config_info(config_type: str = 'all') ToolResponse[源代码]#

Get advanced DataJuicer configuration information from local installation.

This tool retrieves advanced configuration options that go beyond the basic YAML template, enabling more sophisticated data processing scenarios.

参数:

config_type (str) -- Type of configuration to retrieve. Options: - "global": Get additional global parameters from config_all.yaml - "dataset": Get flexible dataset configuration from DatasetCfg.md - "all": Get both global parameters and dataset configuration Defaults to "all".

返回:

Advanced configuration information formatted as Markdown,

including parameter descriptions, default values, and usage examples.

返回类型:

ToolResponse

备注

This tool requires DATA_JUICER_PATH to be configured. If not configured, it will return an error message prompting the user to set up the path through the router agent.

示例

>>> get_advanced_config_info(config_type="global")
Returns global parameters from config_all.yaml
>>> get_advanced_config_info(config_type="dataset")
Returns dataset configuration documentation