data_juicer.tools.DJ_mcp_recipe_flow module#
- data_juicer.tools.DJ_mcp_recipe_flow.get_global_config_schema() dict[source]#
Get the full schema of all available global configuration options for Data-Juicer.
Returns a dictionary where each key is a config parameter name, and the value is a dict containing: - type: the expected type of the parameter (e.g. “bool”, “int”, “str”) - default: the default value - description: a human-readable description of the parameter
Use this tool to discover what configuration options can be passed to run_data_recipe via the extra_config parameter. This dynamically reflects the latest Data-Juicer configuration, so it will always be up-to-date even as new config options are added.
- Returns:
A dict mapping config parameter names to their schema info
- data_juicer.tools.DJ_mcp_recipe_flow.get_dataset_load_strategies() dict[source]#
Get all available dataset loading strategies supported by Data-Juicer.
Returns information about each strategy including its executor type, data type, data source, required/optional configuration fields, and description. Use this tool to understand how to configure the ‘dataset’ parameter in run_data_recipe for different data sources (e.g., local files, HuggingFace, S3, ModelScope, etc.).
The ‘dataset’ parameter in run_data_recipe accepts a dict with: - configs: a list of dataset config dicts, each containing a ‘type’
field that maps to a data source strategy (e.g., ‘local’, ‘huggingface’)
max_sample_num: optional max number of samples to load
Each dataset config dict should follow the required/optional fields described in the returned strategy information.
- Returns:
A dict mapping strategy identifiers to their configuration info
- data_juicer.tools.DJ_mcp_recipe_flow.search_ops(query: str | None = None, op_type: str | None = None, tags: List[str] | None = None, match_all: bool = True, search_mode: str = 'tags', top_k: int = 10) dict[source]#
Search for available data processing operators.
Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc.
Supports multiple search modes: - “basic”: filter by op_type and/or tags (default, original behavior).
If both tags and op_type are None, returns all operators.
“regex”: Python regex pattern matching against OP names, descriptions, and parameters. Requires the query parameter.
“bm25”: BM25 text relevance ranking for natural language queries. Returns top_k most relevant operators. Requires the query parameter.
op_type and tags can be combined with any search_mode as additional filters to narrow down results.
The following op_type values are supported: - aggregator: Aggregate for batched samples, such as summary or conclusion. - deduplicator: Detects and removes duplicate samples. - filter: Filters out low-quality samples. - grouper: Group samples to batched samples. - mapper: Edits and transforms samples. - selector: Selects top samples based on ranking. - pipeline: Applies dataset-level processing; both input and output are datasets.
The tags parameter specifies the characteristics of the data or the required resources. Available tags are:
- Modality Tags:
text: process text data specifically.
image: process image data specifically.
audio: process audio data specifically.
video: process video data specifically.
multimodal: process multimodal data.
- Resource Tags:
cpu: only requires CPU resource.
gpu: requires GPU/CUDA resource as well.
- Model Tags:
api: equipped with API-based models (e.g. ChatGPT, GPT-4o).
vllm: equipped with models supported by vLLM.
hf: equipped with models from HuggingFace Hub.
- Parameters:
query – Search query string. Required for “regex” and “bm25” modes. For “regex” mode, this should be a Python regex pattern. For “bm25” mode, this should be a natural language description of the desired functionality.
op_type – The type of data processing operator to filter by. If None, no type-based filtering is applied. Defaults to None.
tags – An optional list of tags to filter operators. If None, no tag-based filtering is applied. Defaults to None.
match_all – If True, only operators matching all specified tags are returned. If False, operators matching any tag are returned. Defaults to True.
search_mode – The search strategy to use. One of “tags”, “regex”, or “bm25”. Defaults to “tags”.
top_k – Maximum number of results to return for “bm25” mode. Defaults to 10. Ignored for other modes.
- Returns:
A dict containing detailed information about the matched operators, keyed by operator name.
- data_juicer.tools.DJ_mcp_recipe_flow.run_data_recipe(process: list[Dict], dataset_path: str | None = None, dataset: Dict | None = None, export_path: str | None = None, np: int = 1, extra_config: Dict | None = None) str[source]#
Run a data processing recipe using Data-Juicer operators.
If you want to run one or more DataJuicer data processing operators, use this tool. Supported operators and their arguments should be obtained through the search_ops tool.
For advanced configuration options (e.g., enabling tracing, op fusion, checkpoint, multimodal keys, etc.), first call get_global_config_schema to discover available options, then pass them via extra_config.
For loading datasets from different sources (e.g., HuggingFace, S3), first call get_dataset_load_strategies to discover available loading strategies and their required fields, then pass the configuration via the dataset parameter.
- Parameters:
process – List of processing operations to be executed sequentially. Each element is a dictionary with operator name as key and its configuration as value.
dataset_path – Path to the dataset to be processed. This is the simplest way to specify input data (local file path).
dataset –
Optional dataset configuration dict for advanced data loading. Supports multiple data sources (local, HuggingFace, S3, etc.). Format follows Data-Juicer’s dataset config schema: {“configs”: [{“type”: “local”, “path”: “…”}, …],
”max_sample_num”: 10000}
Use get_dataset_load_strategies to discover available options. When provided alongside dataset_path, both are passed to Data-Juicer (dataset_path serves as a fallback).
export_path – Path to export the processed dataset. Defaults to None, which exports to ‘./outputs’ directory.
np – Number of processes to use. Defaults to 1.
extra_config – Optional dict of additional global configuration options. Use get_global_config_schema to discover all available options. Example: {“open_tracer”: true, “trace_num”: 20, “op_fusion”: true, “text_keys”: “instruction”}
Example
# Basic usage: filter text samples >>> run_data_recipe( … “/path/to/dataset.jsonl”, … [{“text_length_filter”: {“min_len”: 10, “max_len”: 50}}] … )
# Advanced usage with tracing and HuggingFace dataset >>> run_data_recipe( … dataset_path=””, … process=[{“language_id_score_filter”: {“lang”: “en”}}], … dataset={ … “configs”: [{ … “type”: “huggingface”, … “path”: “tatsu-lab/alpaca”, … “split”: “train” … }] … }, … extra_config={ … “open_tracer”: True, … “trace_num”: 20, … “text_keys”: “instruction” … } … )
- data_juicer.tools.DJ_mcp_recipe_flow.analyze_dataset(process: list[Dict], dataset_path: str | None = None, dataset: Dict | None = None, export_path: str | None = None, np: int = 1, percentiles: List[float] | None = None, extra_config: Dict | None = None) str[source]#
Analyze a dataset using Data-Juicer’s Analyzer pipeline.
This tool computes statistics for the specified filter and tagging operators on the dataset, then performs overall analysis, column-wise analysis, and correlation analysis. It generates stats tables and distribution figures to help understand the dataset characteristics before applying actual data processing.
This is the equivalent of the
dj-analyzecommand. Use it to understand your dataset’s quality distribution, identify outliers, and determine appropriate filter thresholds before runningrun_data_recipe.Supported operators and their arguments should be obtained through the
search_opstool. Only filter-type and tagging-type operators will produce meaningful analysis results.- Parameters:
process – List of filter/tagging operations to compute stats for. Each element is a dictionary with operator name as key and its configuration as value. Only filter and tagging operators produce analysis stats.
dataset_path – Path to the dataset to be analyzed. This is the simplest way to specify input data (local file path).
dataset – Optional dataset configuration dict for advanced data loading. Same format as in run_data_recipe.
export_path – Path to export the analyzed dataset with stats. Defaults to None, which exports to ‘./outputs’ directory.
np – Number of processes to use. Defaults to 1.
percentiles – List of percentiles to compute for the dataset distribution analysis. Defaults to [0.25, 0.5, 0.75].
extra_config –
Optional dict of additional global configuration options. Use
get_global_config_schemato discover all available options. Analysis-specific options include: - export_original_dataset (bool): whether to export the originaldataset with stats (default: False)
save_stats_in_one_file (bool): whether to save all stats into one file (default: False)
Example
# Analyze text length and language distribution >>> analyze_dataset( … dataset_path=”/path/to/dataset.jsonl”, … process=[ … {“text_length_filter”: {“min_len”: 10, “max_len”: 1000}}, … {“language_id_score_filter”: {“lang”: “en”}} … ], … percentiles=[0.1, 0.25, 0.5, 0.75, 0.9] … )
- Returns:
A message indicating where the analysis results are saved, including the export path and the analysis directory containing stats tables and distribution figures.