# Tools Architecture This document describes the current tool-layer architecture inside `data_juicer_agents`. ## 1. Design Goal The tool layer is the stable atomic capability surface inside `data_juicer_agents`. It serves three consumers: - CLI and command surfaces - the AgentScope-backed `dj-agents` session - future external skill packaging The key rule is: - tool definitions are runtime-agnostic and explicit-input/output - higher layers must not rely on hidden session defaults or tool-internal state fallback - runtime adapters may change transport/schema presentation, but not tool semantics ## 2. Core Tool Contracts Core contracts live in: - `data_juicer_agents/core/tool/contracts.py` - `data_juicer_agents/core/tool/registry.py` - `data_juicer_agents/core/tool/catalog.py` They define: - `ToolSpec` - `ToolContext` - `ToolResult` - `ToolRegistry` Responsibilities: - describe what a tool is - define explicit input and output schemas - register built-in tool specs - avoid direct dependency on AgentScope, TUI, session state, or CLI rendering ## 3. Tool Groups `data_juicer_agents/tools/` is organized by tool group. Each group publishes `TOOL_SPECS` through `registry.py`. Concrete tools usually live under per-tool subdirectories with: - `input.py`: input model - `logic.py`: reusable implementation - `tool.py`: `ToolSpec` binding Package-level `__init__.py` files re-export stable helpers, and some groups keep shared models or validators in sibling modules such as `_shared/`. ### `tools/context` - Files: - `context/registry.py` - `context/inspect_dataset/{input.py,logic.py,tool.py}` - Main responsibilities: - dataset inspection - dataset schema probing ### `tools/retrieve` - Files: - `retrieve/registry.py` - `retrieve/retrieve_operators/{input.py,logic.py,tool.py}` - `retrieve/retrieve_operators/{backend.py,operator_registry.py,catalog.py}` - Main responsibilities: - operator retrieval entrypoints for the main package - canonical operator-name resolution - installed-operator lookup ### `tools/plan` - Files: - `plan/registry.py` - `plan//{input.py,logic.py,tool.py}` - `plan/_shared/*.py` - Main responsibilities: - staged dataset/process/system specs and the final plan model - deterministic planner core - plan validation - explicit plan assembly and persistence helpers ### `tools/apply` - Files: - `apply/registry.py` - `apply/apply_recipe/{input.py,logic.py,tool.py}` - Main responsibilities: - recipe materialization - plan execution - structured execution results ### `tools/dev` - Files: - `dev/registry.py` - `dev/develop_operator/{input.py,logic.py,tool.py,scaffold.py}` - Main responsibilities: - custom operator scaffold generation - optional smoke-check ### `tools/files` - Files: - `files/registry.py` - `files/{view_text_file,write_text_file,insert_text_file}/...` - Main responsibilities: - read / write / insert text file helpers ### `tools/process` - Files: - `process/registry.py` - `process/{execute_shell_command,execute_python_code}/...` - Main responsibilities: - shell execution - python snippet execution ## 4. Runtime Adapters Runtime-specific adaptation is not placed in the tool groups. ### AgentScope adapter - `data_juicer_agents/adapters/agentscope/tools.py` - `data_juicer_agents/adapters/agentscope/schema_utils.py` Responsibilities: - convert `ToolSpec` into AgentScope-compatible callable/schema - normalize JSON schema so agent-facing tool calls stay shallow and explicit - map `ToolResult` into AgentScope responses - apply generic argument preview truncation ### Session runtime / toolkit - `data_juicer_agents/capabilities/session/toolkit.py` - `data_juicer_agents/capabilities/session/runtime.py` Responsibilities: - create the session runtime - emit tool lifecycle events for TUI/CLI observation - choose which registered tools are exposed to `DJSessionAgent` - keep session memory observational only; tool semantics remain explicit ## 5. Default Registry and Session Toolkit Built-in tool registration is assembled through: - `data_juicer_agents/core/tool/catalog.py` That catalog discovers tool groups under `data_juicer_agents/tools/` and loads each group's `TOOL_SPECS` (currently via `registry.py` in every built-in group). It feeds them into: - `build_default_tool_registry()` The session toolkit currently uses the default registry directly and orders tools by functional group priority. It does not depend on `session` tags embedded in tool definitions. ## 6. Current Session Tool Set The default registry currently exposes these tools to the session runtime: - `inspect_dataset` - `retrieve_operators` - `build_dataset_spec` - `build_process_spec` - `build_system_spec` - `validate_dataset_spec` - `validate_process_spec` - `validate_system_spec` - `assemble_plan` - `plan_validate` - `plan_save` - `apply_recipe` - `develop_operator` - `view_text_file` - `write_text_file` - `insert_text_file` - `execute_shell_command` - `execute_python_code` These tools stay generic. Session orchestration must call them with explicit arguments based on prior tool outputs. ## 7. Boundary Summary - `core/tool/*` defines tool contracts, discovery, and registry - `tools//*` defines atomic tools only - `adapters/agentscope/*` adapts tools to AgentScope transport/schema - `capabilities/session/*` orchestrates tools conversationally without changing tool semantics This is the internal shape that future atomic CLI and skill packaging should build on.