Data-Juicer Q&A Copilot Demo#
This document shows end-to-end examples of the QA Copilot. Each case lists the user query (input) and the assistant answer (output).
You can chat with our Q&A Copilot Juicer on the official documentation site of Data-Juicer! Feel free to ask Juicer anything related to Data-Juicer ecosystem.
Case 1 ยท Operator Introduction (alphanumeric_filter, EN)#
Type:
operator_qaLanguage:
en
Input (user query)
Introduce alphanumeric_filter
Output (assistant answer)
The alphanumeric_filter is a Data-Juicer operator designed to filter text samples based on the ratio of alphanumeric characters (or tokens) to the total number of characters (or tokens). It helps ensure data quality by retaining only those samples that fall within a specified alphanumeric ratio range.
Key Features:#
Character-based or Token-based: Depending on the
tokenizationparameter, it can calculate the ratio either by counting characters or by using a Hugging Face tokenizer to count tokens.Caching Metrics: The computed metric โ
'alpha_token_ratio'(if tokenized) or'alnum_ratio'(if character-based) โ is cached in the sampleโs stats field for potential reuse.Flexible Range Control: You define
min_ratioandmax_ratioto specify the acceptable range of alphanumeric content.
Parameters:#
tokenization(bool):
IfTrue, uses tokenization via Hugging Face tokenizer; ifFalse, counts characters directly.min_ratio(float):
Minimum acceptable alphanumeric ratio.max_ratio(float):
Maximum acceptable alphanumeric ratio.
Example Usage in Config:#
process:
- alphanumeric_filter:
tokenization: False
min_ratio: 0.3
max_ratio: 0.9
This configuration keeps samples where 30% to 90% of characters are alphanumeric.
Case 2 ยท Module Introduction (DJ-Agents, EN)#
Type:
submodule_qaLanguage:
en
Input (user query)
Introduce DJ-Agents
Output (assistant answer)
Data-Juicer Agents (DJ-Agents) is an intelligent, agent-based framework designed to assist users in interactively creating and optimizing Data-Juicer recipes through natural language conversations. It leverages large language models (LLMs) via DashScope API to interpret user intent, recommend suitable operators, generate configuration code, and even support custom operator development.
Key Features:#
Interactive Recipe Generation
Users can describe their data processing goals in plain language (e.g., โI want to filter out non-English textโ), and DJ-Agents will automatically suggest appropriate operators and generate a working recipe.Custom Operator Development Support
If existing operators donโt meet your needs, DJ-Agents can guide you through creating new ones โ including scaffolding code templates and updating configurations โ especially when paired with a local Data-Juicer source installation.Flexible Running Modes
Command Line Mode: Default mode for quick interactions.
AgentScope Studio Mode: A visual, interactive web interface for richer debugging and step-by-step guidance (requires installing
@agentscope/studiovia npm).
Environment Setup
Requires Python 3.10+.
Needs a valid DashScope API key (export as
DASHSCOPE_API_KEY).Optional: Set
DATA_JUICER_PATHenvironment variable if developing custom operators.
Installation
Usesuv(a fast Python package installer) for streamlined setup:curl -LsSf https://astral.sh/uv/install.sh | sh uv pip install -e .
For custom operator development, clone and install Data-Juicer in editable mode:
git clone https://github.com/datajuicer/data-juicer.git
pip install -e /path/to/data-juicer
References:
Case 3 ยท Concept Question (Base Operator, EN)#
Type:
concept_qaLanguage:
en
Input (user query)
Tell me about the base class of operators
Output (assistant answer)
The base class for operators in Data-Juicer is defined in the file base_op.py, located at data_juicer/ops/base_op.py. This class serves as the foundational structure from which all specific operators (such as filters, mappers, etc.) inherit. It provides common functionalities and interfaces required for consistent operator behavior across the Data-Juicer ecosystem.
Typical responsibilities of the base operator class include:
Defining standard method signatures (e.g.,
process).Handling configuration and initialization parameters.
Providing utilities for logging, statistics, or error handling.
For implementation details, you can explore the source code directly.
References:
Case 4 ยท Quick Start (Install DJ, EN)#
Type:
quick_start_qaLanguage:
en
Input (user query)
How to install DJ?
Output (assistant answer)
To install Data-Juicer (DJ), follow these steps based on your needs:
๐งฉ Prerequisites#
Python version:
>= 3.10and<= 3.12Git (if installing from source)
uv (recommended package installer). Install it via:
# Using curl curl -LsSf https://astral.sh/uv/install.sh | sh # Or using pip pip install uv
๐ Basic Installation#
Install the minimal version from PyPI (core data processing only):
uv pip install py-data-juicer
This includes:
Data loading & manipulation
File system operations
Parallel processing
Basic I/O utilities
๐ฏ Scenario-Based Installation#
Install extra components based on your use case:
Core ML/DL#
uv pip install "py-data-juicer[generic]"
Domain-Specific#
# Computer Vision
uv pip install "py-data-juicer[vision]"
# Natural Language Processing
uv pip install "py-data-juicer[nlp]"
# Audio Processing
uv pip install "py-data-juicer[audio]"
Additional Features#
# Distributed Computing
uv pip install "py-data-juicer[distributed]"
# AI Services & APIs
uv pip install "py-data-juicer[ai_services]"
# Development Tools
uv pip install "py-data-juicer[dev]"
๐ฆ Common Installation Patterns#
1. Text Processing#
uv pip install "py-data-juicer[generic,nlp]"
2. Vision Processing#
uv pip install "py-data-juicer[generic,vision]"
3. Full Pipeline#
uv pip install "py-data-juicer[generic,nlp,vision,distributed]"
4. Complete Installation#
uv pip install "py-data-juicer[all]"
๐ ๏ธ Install from Source#
For latest features or development:
git clone https://github.com/datajuicer/data-juicer.git
cd data-juicer
uv pip install -e .
# Optional: install specific extras
uv pip install -e ".[vision]"
๐ก Use
-efor editable mode.
โ๏ธ OP-Based / Recipe-Based Installation#
For lightweight setups targeting specific operators or recipes:
# Install dependencies for a specific config file
python tools/dj_install.py --config path_to_your_config_file
# OR
dj-install --config path_to_your_config_file
โ Only available when installing from source.
๐ณ Docker Installation#
Pull pre-built image:
docker pull datajuicer/data-juicer:<version_tag>
# e.g., docker pull datajuicer/data-juicer:v0.2.0
Or build from source:
docker build -t datajuicer/data-juicer:<version_tag> .
๐ Notes & Troubleshooting#
โ Verify Installation
import data_juicer as dj
print(dj.__version__)
๐ฌ For Video Operators:
Install FFmpeg and ensure itโs in your $PATH.
Debian/Ubuntu:
sudo apt install ffmpegmacOS:
brew install ffmpeg
๐ Need Help?
Check documentation
Open a GitHub Issue
Join community discussions
๐ References#
Case 5 ยท Processing Guidance (Dirty Words Filter, EN)#
Type:
processing_qaLanguage:
en
Input (user query)
I want to filter text data with dirty word, tell me how to do it
Output (assistant answer)
To filter text data containing dirty or flagged words, you should use the flagged_words_filter operator. This operator is specifically designed to calculate the ratio of flagged (e.g., offensive, dirty, or sensitive) words in a text sample and filters out samples that exceed a specified threshold.
โ
How to Use flagged_words_filter#
This operator:
Computes the ratio of flagged words in each text sample.
Keeps only those samples where the flagged-word ratio falls within your defined range (e.g.,
[0, 0.1]to allow up to 10% flagged words).
Example Configuration in a Recipe (YAML):#
# config.yaml
process:
- flagged_words_filter:
lang: "en"
tokenization: false
lowercase: true
flagged_ratio_min: 0
flagged_ratio_max: 0.05 # Keep samples with <= 5% flagged words
Parameters:#
lang: Language of the text (default:"en").tokenization: Whether to tokenize text before matching (default:false).lowercase: Convert text to lowercase for case-insensitive matching (default:true).flagged_ratio_min/max: Range of acceptable flagged word ratios.
๐ Notes:#
You can customize the list of flagged words by providing your own dictionary file via the
words_fileparameter.The default flagged word list covers common profanity and sensitive terms in multiple languages.
๐ Next Step:#
If you need details on how to provide a custom flagged word list or see code examples, I can retrieve the full operator documentation for you.
References:
Case 6 ยท Best Practices (Using DJ, EN)#
Type:
best_practice_qaLanguage:
en
Input (user query)
Could you give me some best practices for using DJ?
Output (assistant answer)
Here are some best practices for using Data-Juicer (DJ) effectively:
โ 1. Start with Minimal Examples#
Use the Minimal Example Recipes to get familiar with DJโs basic functionality.
These recipes demonstrate core workflows and help you understand configuration structure.
โ 2. Leverage Pre-Built Recipes from DJ-Hub#
Explore the Recipe Gallery for real-world use cases:
Reproduce datasets like RedPajama or BLOOM.
Improve existing datasets (e.g., LLaVA, Alpaca-CoT).
Synthesize multimodal data (e.g., Img-Diff, HumanVBench).
Each recipe includes:
Configuration files
Sample retention rates
Performance benchmarks
Download links (Aliyun, HuggingFace, ModelScope)
๐ก Tip: Reuse and adapt these recipes instead of building from scratch.
โ 3. Install Only What You Need#
Use modular installation to avoid bloat:
# For text processing uv pip install "py-data-juicer[generic,nlp]" # For vision tasks uv pip install "py-data-juicer[generic,vision]" # For full pipeline uv pip install "py-data-juicer[generic,nlp,vision,distributed]"
For video operators, ensure FFmpeg is installed and in your
$PATH.
๐ Reference: Installation Guide
โ 4. Choose the Right Operators#
DJ provides over 180 operators categorized into 8 types:
mapper: Transform samples (e.g., clean text, augment images)filter: Remove low-quality samples (e.g., by perplexity, image aesthetics)deduplicator: Remove duplicates (text, image, video)aggregator,grouper,selector, etc.
๐ Use
search_operators()or browse Operators.md to find the right operator for your task.
โ 5. Use 3-ฯ Rule for Hyperparameter Tuning#
In many refined recipes (e.g., RedPajama), thresholds for filters are set using a 3-sigma statistical rule โ keep values within 3 standard deviations from the mean.
This ensures you retain high-quality samples while removing outliers.
โ 6. Validate with Benchmarks#
After processing, validate performance improvements:
For LLMs: Use downstream task accuracy (e.g., MM-Bench, VQAv2).
For video models: Use VBench scores.
Example: Refined LLaVA pretraining data improved 10 out of 12 evaluation metrics.
โ 7. Automate Dependency Installation#
If installing from source, use:
python tools/dj_install.py --config your_config.yaml
or
dj-install --config your_config.yaml
โ Automatically installs only dependencies needed for your config.
โ 8. Use Docker for Reproducibility#
Pull official images:
docker pull datajuicer/data-juicer:<version_tag>
Ensures consistent environments across machines.
โ 9. Contribute Back to the Community#
Submit your own recipes to data-juicer-hub.
Add new operators following the Developer Guide.
โ 10. Monitor Retention Rates & Quality Metrics#
Track how many samples survive each filtering stage.
Compare before/after model performance to quantify impact.
Use selectors (
topk_specified_field_selector, etc.) to curate final high-quality subsets.
By following these practices, youโll maximize efficiency, reproducibility, and effectiveness when using Data-Juicer for your data processing pipelines.
๐ References:#
Case 7 ยท Off-topic Question (Weather, EN)#
Type:
off_topic_qaLanguage:
en
Input (user query)
Howโs the weather today?
Output (assistant answer)
Sorry, this question is unrelated to Data-Juicer.