data_juicer_agents.utils.dj_config_bridge module#
Bridge to Data-Juicer’s native configuration system.
This module provides a dynamic bridge to Data-Juicer’s configuration, eliminating the need to manually sync schema definitions.
- Public API:
get_dj_config_bridge() → singleton DJConfigBridge instance coerce_fields() → type-coerce dict values via DJ parser hints
- Field classification lists:
dataset_fields → dataset I/O and binding fields system_fields → runtime/executor system fields agent_managed_fields → fields managed at agent/tool boundary (not by LLM)
- class data_juicer_agents.utils.dj_config_bridge.DJConfigBridge[source]#
Bases:
objectBridge to Data-Juicer’s native configuration and validation.
All DJ-dependent logic is centralised here. Callers should obtain the singleton via
get_dj_config_bridge()and call methods on it.- property parser#
Lazy load Data-Juicer base parser (no OPs registered).
- get_default_config() Dict[str, Any][source]#
Return all parser fields with their default values (cached).
- extract_system_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#
Extract system-related fields based on the explicit
system_fieldslist.
- extract_dataset_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#
Extract dataset-related fields.
- extract_agent_managed_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#
Extract agent-managed fields (auto-set by agent, not by LLM).
These fields (e.g.
project_name) are programmatically set during the apply phase and should not be exposed to the LLM for configuration.
- extract_process_config(config: Dict[str, Any] | None = None) List[Dict[str, Any]][source]#
Extract process operator list.
- validate(config: Dict[str, Any]) Tuple[bool, List[str]][source]#
Validate a config dict using DJ base parser.
Checks system/dataset field types and rejects unknown keys. Does NOT validate process list contents or operator params (that is handled by get_op_valid_params in the agents layer).
- Parameters:
config – Config dict to validate.
- Returns:
(is_valid, error_messages)
- get_op_valid_params(op_names: set) Tuple[Dict[str, set], set][source]#
Get valid parameter names for each operator.
Registers the requested operators into a fresh parser, then extracts valid parameter names from the resulting flat actions (e.g.
text_length_filter.min_len->min_len).- Parameters:
op_names – Set of operator names to look up.
- Returns:
(op_param_map, known_op_names)where op_param_map is{op_name: {param, ...}}and known_op_names is the full set of registered DJ operators.
- get_implemented_load_strategies(executor_type: str = 'default') List[Dict[str, Any]][source]#
Dynamically probe DataLoadStrategyRegistry to find truly implemented load strategies by inspecting source code for NotImplementedError.
This avoids hardcoding a whitelist: when the main library fixes a placeholder strategy, the agent automatically discovers it on the next startup with zero manual maintenance.
- Parameters:
executor_type – Filter by executor type (‘default’, ‘ray’, or ‘*’ for all).
- Returns:
executor_type, type, source, config_validation_rules (required_fields, optional_fields).
- Return type:
List of dicts with keys
- data_juicer_agents.utils.dj_config_bridge.get_dj_config_bridge() DJConfigBridge[source]#
Get singleton DJConfigBridge instance.
- data_juicer_agents.utils.dj_config_bridge.coerce_fields(fields: Dict[str, Any]) Tuple[Dict[str, Any], List[str]][source]#
Coerce field values to their correct basic Python types via DJ parser.
Performs safe conversions for basic types (
bool,int,float) by inspecting the DJ parser’s registered default-value types. Fields with non-basic target types or fields not registered in the parser are passed through unchanged.This is used during normalization to ensure values serialise correctly in recipe YAML (e.g.
"true"->True,"4"->4).- Parameters:
fields – Dict of config fields to coerce.
- Returns:
(coerced_fields, errors)where errors lists human-readable messages for any field that failed type coercion.