DataJuicer Agents#
A multi-agent data processing system built on AgentScope and Data-Juicer (DJ). This project demonstrates how to leverage the natural language understanding capabilities of large language models, enabling non-expert users to easily harness the powerful data processing capabilities of Data-Juicer.
๐ฏ Why DataJuicer Agents?#
In the actual work of large model R&D and applications, data processing remains a high-cost, low-efficiency, and hard-to-reproduce process. Many teams spend more time on data analysis, cleaning and synthesis than on model training, requirement alignment and app development.
We hope to liberate developers from tedious script assembly through agent technology, making data R&D closer to a โthink and getโ experience.
Data directly defines the upper limit of model capabilities. What truly determines model performance are multiple dimensions such as quality, diversity, harmfulness control, and task matching of data. Optimizing data is essentially optimizing the model itself. To do this efficiently, we need a systematic toolset.
DataJuicer Agents is designed to support the new paradigm of data-model co-optimization as an intelligent collaboration system.
๐ Table of Contents#
What Does This Agent Do?#
Data-Juicer (DJ) is an open-source processing system covering the full lifecycle of large model data, providing four core capabilities:
Full-Stack Operator Library (DJ-OP): Nearly 200 high-performance, reusable multimodal operators covering text, images, and audio/video
High-Performance Engine (DJ-Core): Built on Ray, supporting TB-level data, 10K-core distributed computing, with operator fusion and multi-granularity fault tolerance
Collaborative Development Platform (DJ-Sandbox): Introduces A/B Test and Scaling Law concepts, using small-scale experiments to drive large-scale optimization
Natural Language Interaction Layer (DJ-Agents): Enables developers to build data pipelines through conversational interfaces using Agent technology
DataJuicer Agents is not a simple Q&A bot, but an intelligent collaborator for data processing. Specifically, it can:
Intelligent Query: Automatically match the most suitable operators based on natural language descriptions (precisely locating from nearly 200 operators)
Automated Pipeline: Describe data processing needs, automatically generate Data-Juicer YAML configurations and execute them
Custom Extension: Help users develop custom operators and seamlessly integrate them into local environments
Our goal: Let developers focus on โwhat to doโ rather than โhow to do itโ.
Architecture#
Multi-Agent Routing Architecture#
DataJuicer Agents adopts a multi-agent routing architecture, which is key to system scalability. When a user inputs a natural language request, the Router Agent first performs task triage to determine whether itโs a standard data processing task or a custom requirement that needs new capabilities.
User Query
โ
Router Agent (Filtering & Decision) โ query_dj_operators (operator retrieval)
โ
โโ High-match operator found
โ โ
โ DJ Agent (Standard Data Processing Task)
| โโโ Preview data samples (confirm field names and data formats)
โ โโโ get_ops_signature (retrieve full parameter signatures)
โ โโโ Generate YAML configuration
โ โโโ execute_safe_command (run dj-process, dj-analyze)
โ
โโ No high-match operator found
โ
Dev Agent (Custom Operator Development & Integration)
โโโ get_basic_files (retrieve base classes and registration mechanism)
โโโ get_operator_example (retrieve similar operator examples)
โโโ Generate compliant operator code
โโโ Local integration (register to user-specified path)
Two Integration Modes#
Agent integration with DataJuicer has two modes to adapt to different usage scenarios:
Tool Binding Mode: Agent calls DataJuicer command-line tools (such as
dj-analyze,dj-process), compatible with existing user habits, low migration costMCP Binding Mode: Agent directly calls DataJuicerโs MCP (Model Context Protocol) interface, no need to generate intermediate YAML files, directly run operators or data recipes, better performance
These two modes are automatically selected by the Agent based on task complexity and performance requirements, ensuring both flexibility and efficiency.
Roadmap#
The Data-Juicer agent ecosystem is rapidly expanding. Here are the new agents currently in development or planned:
Data-Juicer Q&A Agent (Demo Available)#
Provides users with detailed answers about Data-Juicer operators, concepts, and best practices.
The Q&A agent can currently be viewed and tried out here.
Interactive Data Analysis and Visualization Agent (In Development)#
We are building a more advanced human-machine collaborative data optimization workflow that introduces human feedback:
Users can view statistics, attribution analysis, and visualization results
Dynamically edit recipes, approve or reject suggestions
Underpinned by
dj.analyzer(data analysis),dj.attributor(effect attribution), anddj.sandbox(experiment management)Supports closed-loop optimization based on validation tasks
This interactive recipe can currently be viewed and tried out here.
Other Directions#
Data Processing Agent Benchmarking: Quantify the performance of different Agents in terms of accuracy, efficiency, and robustness
Data โHealth Check Reportโ & Data Intelligent Recommendation: Automatically diagnose data problems and recommend optimization solutions
Router Agent Enhancement: More seamless, e.g., when operators are lacking โ Code Development Agent โ Data Processing Agent
MCP Further Optimization: Embedded LLM, users can directly use MCP connected to their local environment (e.g., IDE) to get an experience similar to current data processing agents
Knowledge Base and RAG-oriented Data Agents
Better Automatic Processing Solution Generation: Less token usage, more efficient, higher quality processing results
Data Workflow Template Reuse and Automatic Tuning: Based on DataJuicer community data recipes
โฆโฆ
Common Issues#
Q: How to get DashScope API key? A: Visit DashScope official website to register an account and apply for an API key.
Q: Why does operator retrieval fail? A: Please check network connection and API key configuration, or try switching to vector retrieval mode.
Q: How to debug custom operators? A: Ensure Data-Juicer path is configured correctly and check the example code provided by the code development agent.
Q: What to do if MCP service connection fails? A: Check if the MCP server is running and confirm the URL address in the configuration file is correct.
Q: Error: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:3000/trpc/pushMessage A: Agents handle data via file references (paths) rather than direct uploads. Please confirm whether any non-text files were submitted.
Optimization Recommendations#
For large-scale data processing, it is recommended to use DataJuicerโs distributed mode
Set batch size appropriately to balance memory usage and processing speed
For more advanced data processing features (synthesis, Data-Model Co-Development), please refer to DataJuicer documentation