🔧 InteRecipe: Interactive Recipe Generation Workflow#

Overview#

This demo showcases an interactive and progressive workflow for generating data processing recipes using a Data-Juicer Operator Pool. The system enables users and agents to collaboratively build, edit, and validate recipes in a flexible and transparent manner.

Usage#

Before running, set below environment variables:

export DASHSCOPE_API_KEY=your_dashscope_key

Install dependencies:

uv pip install -r requirements.txt

[Optional] Start the copilot server (replace the DATA_JUICER_PATH variable in ../qa-copilot/setup_server.sh with the absolute path to your data-juicer repository):

cd ../qa-copilot
bash setup_server.sh

Launch the demo with streamlit:

streamlit run app.py

InteRecipe’s core functionality and Q&A Copilot (Ask AI component) are mutually independent. The latter requires separate deployment but does not affect the operation of the former. About Q&A Copilot Detailed Configuration, please refer to qa-copilot/README.md

Operator Pool Usage#

Check ./playground.ipynb.

✨ Core Feature: Operator Pool#

The Operator Pool is a specialized, ordered dictionary-like object that stores all candidate Data-Juicer operators (ops) for data processing.

Each operator in the pool includes:

Basic information: name and description
Status: whether the operator is enabled
Arguments: name, description, type, and current value
Statistics (based on a dataset snapshot): min, max, mean, std, quantiles
Ordering: the position of the operator in the current workflow

📊 Visualization & Interaction

The full state of the operator pool is visualized to provide users with a clear and editable overview.
The LLM agent leverages this state to suggest modifications or improvements to the data recipe.

🛠️ Supported Actions

Both the user and the LLM agent can take the following actions:

Enable or disable an operator
Modify argument values of an operator
Change the execution order of operators

Each unique configuration of the operator pool corresponds to a distinct data processing recipe.

❓ Why Use an Operator Pool?

Progressive & Interactive Recipe Generation

Recipe construction is typically multi-stage—e.g., modality alignment, goal specification, data analysis, attribution, etc. The operator pool enables fine-grained control and editing at each stage, supporting incremental and iterative development.

Robustness & Validity

Directly asking an LLM to generate a full data recipe in one step often results in invalid outputs. With the operator pool, each modification is validated through strict checks, ensuring recipe integrity and providing feedback when issues arise.

Modules#

LLM Assistant Module: This module can be used to consult LLM Assistant with current operator pool status and auxiliary information. The user can apply the suggestions generated by LLM by a single click.
Data Analysis Module: This module leverages the dj.analyzer toolkit to perform comprehensive data analysis. The statistics are properly visualized to assist the user in editing the operator pool.
[WIP] Data Attribution Module: This module aims to measure the contribution of each operator to validation tasks. The corresponding toolkit is under development as dj.attributor.
[WIP] Sandbox Module: This module leverages the dj.sandbox toolkit to enable feed-back driven operator selection and edition, by performing small scale experiments.