Advanced Features#
Operator Retrieval#
Operator retrieval is the core of whether the Agent can work accurately. DJ Agent implements an intelligent operator retrieval tool that quickly finds the most relevant operators from Data-Juicer’s nearly 200 operators through an independent LLM query process. This is a key component enabling the data processing agent and code development agent to run accurately.
We don’t use a single solution, but provide three modes that can be flexibly selected via the -r parameter:
Retrieval Modes#
LLM Retrieval (default)
Uses Qwen-Turbo to understand user requirements from a semantic level, suitable for complex and vague descriptions
Provides detailed matching reasons and relevance scores
Higher token consumption, but highest matching accuracy
Vector Retrieval (vector)
Based on DashScope text embedding + FAISS similarity search
Fast, suitable for batch tasks or rapid prototyping
No need to call LLM, lower cost
Auto Mode (auto)
Prioritizes LLM retrieval, automatically falls back to vector retrieval on failure
Usage#
Specify the retrieval mode using the -r or --retrieval-mode parameter:
dj-agents --retrieval-mode vector
For more parameter descriptions, see dj-agents --help
MCP Agent#
In addition to command-line tools, DataJuicer also natively supports MCP services, which is an important means to improve performance. MCP services can directly obtain operator information and execute data processing through native interfaces, making it easy to migrate and integrate without separate LLM queries and command-line calls.
MCP Server Types#
Data-Juicer provides two types of MCP:
Recipe-Flow MCP (Data Recipe)
Provides two tools:
get_data_processing_opsandrun_data_recipeRetrieves by operator type, applicable modalities, and other tags, no need to call LLM or vector models
Suitable for standardized, high-frequency scenarios with better performance
Granular-Operators MCP (Fine-grained Operators)
Wraps each built-in operator as an independent tool, runs on call
Returns all operators by default, but can control visible scope through environment variables
Suitable for fine-grained control, building fully customized data processing pipelines
This means that in some scenarios, the Agent’s call path can be shorter, faster, and more direct than manually writing YAML.
For detailed information, please refer to: Data-Juicer MCP Service Documentation
Note: The Data-Juicer MCP server is currently in early development, and features and tools may change with ongoing development.
Configuration#
Configure the service address in configs/mcp_config.json:
{
"mcpServers": {
"DJ_recipe_flow": {
"url": "http://127.0.0.1:8080/sse"
}
}
}
Usage Methods#
Enable MCP Agent to replace Data Processing Agent:
# Enable MCP Agent and Dev Agent
dj-agents --agents dj_mcp dj_dev
# Or use shorthand
dj-agents -a dj_mcp dj_dev