快速开始#

1. 环境前置#

Python >=3.10,<3.13
Data-Juicer 运行时（py-data-juicer）
DashScope 或 OpenAI 兼容 API Key

2. 安装#

选择一种安装档位：

core：完整 data_juicer_agents 命令面
harness：面向 djx tool harness profile 的最小安装
full：core 加上 copilot 和 interecipe

cd ./data-juicer-agents
uv venv .venv
source .venv/bin/activate
uv pip install -e '.[core]'

Harness 安装：

uv pip install -e '.[harness]'
export DJX_TOOL_PROFILE=harness

Full 安装：

uv pip install -e '.[full]'

3. 配置模型访问#

export DASHSCOPE_API_KEY="<your_key>"
# 或：
# export MODELSCOPE_API_TOKEN="<your_key>"

# 可选覆盖
export DJA_OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export DJA_SESSION_MODEL="qwen3-max-2026-01-23"
export DJA_PLANNER_MODEL="qwen3-max-2026-01-23"
export DJA_MODEL_FALLBACKS="qwen-max,qwen-plus"
export DJA_LLM_THINKING="true"

4. 最短 CLI 路径#

可选的观察步骤：

djx retrieve "去除重复文本" \
  --dataset ./data/demo-dataset.jsonl \
  --top-k 8

生成 plan：

djx plan "做 RAG 清洗和文本去重" \
  --dataset ./data/demo-dataset.jsonl \
  --export ./data/demo-dataset-processed.jsonl \
  --output ./data/demo-plan.yaml

执行已保存的 plan：

djx apply --plan ./data/demo-plan.yaml --yes

只做 dry-run，不真正执行 dj-process：

djx apply --plan ./data/demo-plan.yaml --yes --dry-run

说明：

djx plan 在构建最终 plan 之前会先做内部算子检索。
djx retrieve 适合用于观察和调试候选算子。

最短原子工具路径：

djx tool list --tag plan
djx tool schema inspect_dataset
djx tool run list_system_config --input-json '{}'

说明：

djx tool 默认是 JSON-only，主要服务 agent / skill 自动化调用。
写入或执行类工具需要显式传 --yes。
DJX_TOOL_PROFILE=harness 会把 djx tool 限制在 harness 工具集（apply、context、retrieve、plan）。

5. 会话模式（`dj-agents`）#

默认 TUI：

dj-agents --dataset ./data/demo-dataset.jsonl --export ./data/demo-dataset-processed.jsonl

纯终端模式：

dj-agents --ui plain --dataset ./data/demo-dataset.jsonl --export ./data/demo-dataset-processed.jsonl

AgentScope Studio 模式：

as_studio
dj-agents --ui as_studio --studio-url http://localhost:3000 --dataset ./data/demo-dataset.jsonl --export ./data/demo-dataset-processed.jsonl

说明：

dj-agents 需要可用的 LLM 访问配置。
会话中按 Ctrl+C 可中断当前轮，按 Ctrl+D 退出。
as_studio 模式需要先单独启动 AgentScope Studio。
会话内常见 planning 链路是 inspect_dataset -> retrieve_operators -> build_dataset_spec -> build_process_spec -> build_system_spec -> assemble_plan -> plan_validate -> plan_save。
如需做算子级发现和 schema 查看，优先使用 retrieve_operators / retrieve_operators_api，再用 get_operator_info。

6. 最小检查#

djx --help
djx retrieve "过滤长文本" --json
djx plan "过滤长文本" --dataset ./data/demo-dataset.jsonl --export ./data/out.jsonl --verbose
djx apply --plan ./data/demo-plan.yaml --yes --dry-run
dj-agents --help

7. 故障排查#

如果 planning 或 session 启动时出现模型 / API 错误，优先检查：

DASHSCOPE_API_KEY 或 MODELSCOPE_API_TOKEN
DJA_OPENAI_BASE_URL
DJA_SESSION_MODEL 和 DJA_PLANNER_MODEL
期望模型兜底时的 DJA_MODEL_FALLBACKS
如果服务端不接受 thinking 参数，检查 DJA_LLM_THINKING

如果命令提示缺少可选依赖，请安装对应档位：

data-juicer-agents[harness]：只使用 harness 版 djx tool
data-juicer-agents[core]：完整 djx / dj-agents 命令集
data-juicer-agents[full]：core + copilot + interecipe