data_juicer_agents.tools.dj_helpers module#
- async data_juicer_agents.tools.dj_helpers.execute_safe_command(command: str, timeout: int = 300, **kwargs: Any) ToolResponse[源代码]#
Execute safe commands including DataJuicer commands and other safe system commands. Returns the return code, standard output and error within <returncode></returncode>, <stdout></stdout> and <stderr></stderr> tags.
- 参数:
command (str) -- The command to execute. Allowed commands include: - DataJuicer commands: dj-process, dj-analyze - File system commands: mkdir, ls, pwd, cat, echo, cp, mv, rm - Text processing: grep, head, tail, wc, sort, uniq - Archive commands: tar, zip, unzip - Other safe commands: which, whoami, date, find
timeout (float, defaults to 300) -- The maximum time (in seconds) allowed for the command to run.
- 返回:
The tool response containing the return code, standard output, and standard error of the executed command.
- 返回类型:
ToolResponse
- async data_juicer_agents.tools.dj_helpers.get_ops_signature(op_names: List[str]) ToolResponse[源代码]#
Get detailed information for specified DataJuicer operators.
This tool retrieves comprehensive operator metadata including signatures, parameter descriptions, and usage information. It's designed to help the data processing agent generate accurate YAML configuration files.
- 参数:
op_names (List[str]) -- List of operator names to query (e.g., ['text_length_filter', 'image_shape_filter'])
- 返回:
- Detailed operator information including:
Operator type (Filter/Mapper/Deduplicator)
Function signature with parameter types
Parameter descriptions and default values
Usage examples or constraints
- 返回类型:
ToolResponse
示例
>>> get_ops_signature(['text_length_filter', 'image_face_count_filter']) Returns detailed configuration info for both operators
- async data_juicer_agents.tools.dj_helpers.get_advanced_config_info(config_type: str = 'all') ToolResponse[源代码]#
Get advanced DataJuicer configuration information from local installation.
This tool retrieves advanced configuration options that go beyond the basic YAML template, enabling more sophisticated data processing scenarios.
- 参数:
config_type (str) -- Type of configuration to retrieve. Options: - "global": Get additional global parameters from config_all.yaml - "dataset": Get flexible dataset configuration from DatasetCfg.md - "all": Get both global parameters and dataset configuration Defaults to "all".
- 返回:
- Advanced configuration information formatted as Markdown,
including parameter descriptions, default values, and usage examples.
- 返回类型:
ToolResponse
备注
This tool requires DATA_JUICER_PATH to be configured. If not configured, it will return an error message prompting the user to set up the path through the router agent.
示例
>>> get_advanced_config_info(config_type="global") Returns global parameters from config_all.yaml
>>> get_advanced_config_info(config_type="dataset") Returns dataset configuration documentation