data_juicer_agents.tools.dj_helpers module#

async data_juicer_agents.tools.dj_helpers.execute_safe_command(command: str, timeout: int = 300, **kwargs: Any) ToolResponse[source]#

Execute safe commands including DataJuicer commands and other safe system commands. Returns the return code, standard output and error within <returncode></returncode>, <stdout></stdout> and <stderr></stderr> tags.

Parameters:
  • command (str) – The command to execute. Allowed commands include: - DataJuicer commands: dj-process, dj-analyze - File system commands: mkdir, ls, pwd, cat, echo, cp, mv, rm - Text processing: grep, head, tail, wc, sort, uniq - Archive commands: tar, zip, unzip - Other safe commands: which, whoami, date, find

  • timeout (float, defaults to 300) – The maximum time (in seconds) allowed for the command to run.

Returns:

The tool response containing the return code, standard output, and standard error of the executed command.

Return type:

ToolResponse

async data_juicer_agents.tools.dj_helpers.get_ops_signature(op_names: List[str]) ToolResponse[source]#

Get detailed information for specified DataJuicer operators.

This tool retrieves comprehensive operator metadata including signatures, parameter descriptions, and usage information. It’s designed to help the data processing agent generate accurate YAML configuration files.

Parameters:

op_names (List[str]) – List of operator names to query (e.g., [‘text_length_filter’, ‘image_shape_filter’])

Returns:

Detailed operator information including:
  • Operator type (Filter/Mapper/Deduplicator)

  • Function signature with parameter types

  • Parameter descriptions and default values

  • Usage examples or constraints

Return type:

ToolResponse

Example

>>> get_ops_signature(['text_length_filter', 'image_face_count_filter'])
Returns detailed configuration info for both operators
async data_juicer_agents.tools.dj_helpers.get_advanced_config_info(config_type: str = 'all') ToolResponse[source]#

Get advanced DataJuicer configuration information from local installation.

This tool retrieves advanced configuration options that go beyond the basic YAML template, enabling more sophisticated data processing scenarios.

Parameters:

config_type (str) – Type of configuration to retrieve. Options: - “global”: Get additional global parameters from config_all.yaml - “dataset”: Get flexible dataset configuration from DatasetCfg.md - “all”: Get both global parameters and dataset configuration Defaults to “all”.

Returns:

Advanced configuration information formatted as Markdown,

including parameter descriptions, default values, and usage examples.

Return type:

ToolResponse

Note

This tool requires DATA_JUICER_PATH to be configured. If not configured, it will return an error message prompting the user to set up the path through the router agent.

Example

>>> get_advanced_config_info(config_type="global")
Returns global parameters from config_all.yaml
>>> get_advanced_config_info(config_type="dataset")
Returns dataset configuration documentation