Tools Architecture#

This document describes the current tool-layer architecture inside data_juicer_agents.

1. Design Goal#

The tool layer is the stable atomic capability surface inside data_juicer_agents.

It serves three consumers:

  • CLI and command surfaces

  • the AgentScope-backed dj-agents session

  • future external skill packaging

The key rule is:

  • tool definitions are runtime-agnostic and explicit-input/output

  • higher layers must not rely on hidden session defaults or tool-internal state fallback

  • runtime adapters may change transport/schema presentation, but not tool semantics

2. Core Tool Contracts#

Core contracts live in:

  • data_juicer_agents/core/tool/contracts.py

  • data_juicer_agents/core/tool/registry.py

  • data_juicer_agents/core/tool/catalog.py

They define:

  • ToolSpec

  • ToolContext

  • ToolResult

  • ToolRegistry

Responsibilities:

  • describe what a tool is

  • define explicit input and output schemas

  • register built-in tool specs

  • avoid direct dependency on AgentScope, TUI, session state, or CLI rendering

3. Tool Groups#

data_juicer_agents/tools/ is organized by tool group.

Each group publishes TOOL_SPECS through registry.py.

Concrete tools usually live under per-tool subdirectories with:

  • input.py: input model

  • logic.py: reusable implementation

  • tool.py: ToolSpec binding

Package-level __init__.py files re-export stable helpers, and some groups keep shared models or validators in sibling modules such as _shared/.

tools/context#

  • Files:

    • context/registry.py

    • context/inspect_dataset/{input.py,logic.py,tool.py}

  • Main responsibilities:

    • dataset inspection

    • dataset schema probing

tools/retrieve#

  • Files:

    • retrieve/registry.py

    • retrieve/retrieve_operators/{input.py,logic.py,tool.py}

    • retrieve/retrieve_operators/{backend.py,operator_registry.py,catalog.py}

  • Main responsibilities:

    • operator retrieval entrypoints for the main package

    • canonical operator-name resolution

    • installed-operator lookup

tools/plan#

  • Files:

    • plan/registry.py

    • plan/<tool_name>/{input.py,logic.py,tool.py}

    • plan/_shared/*.py

  • Main responsibilities:

    • staged dataset/process/system specs and the final plan model

    • deterministic planner core

    • plan validation

    • explicit plan assembly and persistence helpers

tools/apply#

  • Files:

    • apply/registry.py

    • apply/apply_recipe/{input.py,logic.py,tool.py}

  • Main responsibilities:

    • recipe materialization

    • plan execution

    • structured execution results

tools/dev#

  • Files:

    • dev/registry.py

    • dev/develop_operator/{input.py,logic.py,tool.py,scaffold.py}

  • Main responsibilities:

    • custom operator scaffold generation

    • optional smoke-check

tools/files#

  • Files:

    • files/registry.py

    • files/{view_text_file,write_text_file,insert_text_file}/...

  • Main responsibilities:

    • read / write / insert text file helpers

tools/process#

  • Files:

    • process/registry.py

    • process/{execute_shell_command,execute_python_code}/...

  • Main responsibilities:

    • shell execution

    • python snippet execution

4. Runtime Adapters#

Runtime-specific adaptation is not placed in the tool groups.

AgentScope adapter#

  • data_juicer_agents/adapters/agentscope/tools.py

  • data_juicer_agents/adapters/agentscope/schema_utils.py

Responsibilities:

  • convert ToolSpec into AgentScope-compatible callable/schema

  • normalize JSON schema so agent-facing tool calls stay shallow and explicit

  • map ToolResult into AgentScope responses

  • apply generic argument preview truncation

Session runtime / toolkit#

  • data_juicer_agents/capabilities/session/toolkit.py

  • data_juicer_agents/capabilities/session/runtime.py

Responsibilities:

  • create the session runtime

  • emit tool lifecycle events for TUI/CLI observation

  • choose which registered tools are exposed to DJSessionAgent

  • keep session memory observational only; tool semantics remain explicit

5. Default Registry and Session Toolkit#

Built-in tool registration is assembled through:

  • data_juicer_agents/core/tool/catalog.py

That catalog discovers tool groups under data_juicer_agents/tools/ and loads each group’s TOOL_SPECS (currently via registry.py in every built-in group). It feeds them into:

  • build_default_tool_registry()

The session toolkit currently uses the default registry directly and orders tools by functional group priority. It does not depend on session tags embedded in tool definitions.

6. Current Session Tool Set#

The default registry currently exposes these tools to the session runtime:

  • inspect_dataset

  • retrieve_operators

  • build_dataset_spec

  • build_process_spec

  • build_system_spec

  • validate_dataset_spec

  • validate_process_spec

  • validate_system_spec

  • assemble_plan

  • plan_validate

  • plan_save

  • apply_recipe

  • develop_operator

  • view_text_file

  • write_text_file

  • insert_text_file

  • execute_shell_command

  • execute_python_code

These tools stay generic. Session orchestration must call them with explicit arguments based on prior tool outputs.

7. Boundary Summary#

  • core/tool/* defines tool contracts, discovery, and registry

  • tools/<group>/* defines atomic tools only

  • adapters/agentscope/* adapts tools to AgentScope transport/schema

  • capabilities/session/* orchestrates tools conversationally without changing tool semantics

This is the internal shape that future atomic CLI and skill packaging should build on.