data_juicer_agents.utils.dj_config_bridge module#

Bridge to Data-Juicer’s native configuration system.

This module provides a dynamic bridge to Data-Juicer’s configuration, eliminating the need to manually sync schema definitions.

Public API:

get_dj_config_bridge() → singleton DJConfigBridge instance coerce_fields() → type-coerce dict values via DJ parser hints

Field classification lists:

dataset_fields → dataset I/O and binding fields system_fields → runtime/executor system fields agent_managed_fields → fields auto-set by the agent (not by LLM)

class data_juicer_agents.utils.dj_config_bridge.DJConfigBridge[source]#

Bases: object

Bridge to Data-Juicer’s native configuration and validation.

All DJ-dependent logic is centralised here. Callers should obtain the singleton via get_dj_config_bridge() and call methods on it.

__init__()[source]#
property parser#

Lazy load Data-Juicer base parser (no OPs registered).

get_default_config() Dict[str, Any][source]#

Return all parser fields with their default values (cached).

extract_system_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#

Extract system-related fields based on the explicit system_fields list.

extract_dataset_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#

Extract dataset-related fields.

extract_agent_managed_config(config: Dict[str, Any] | None = None) Dict[str, Any][source]#

Extract agent-managed fields (auto-set by agent, not by LLM).

These fields (e.g. project_name) are programmatically set during the apply phase and should not be exposed to the LLM for configuration.

extract_process_config(config: Dict[str, Any] | None = None) List[Dict[str, Any]][source]#

Extract process operator list.

get_param_descriptions() Dict[str, str][source]#

Get help text for all parameters from parser.

validate(config: Dict[str, Any]) Tuple[bool, List[str]][source]#

Validate a config dict using DJ base parser.

Checks system/dataset field types and rejects unknown keys. Does NOT validate process list contents or operator params (that is handled by get_op_valid_params in the agents layer).

Parameters:

config – Config dict to validate.

Returns:

(is_valid, error_messages)

get_op_valid_params(op_names: set) Tuple[Dict[str, set], set][source]#

Get valid parameter names for each operator.

Registers the requested operators into a fresh parser, then extracts valid parameter names from the resulting flat actions (e.g. text_length_filter.min_len -> min_len).

Parameters:

op_names – Set of operator names to look up.

Returns:

(op_param_map, known_op_names) where op_param_map is {op_name: {param, ...}} and known_op_names is the full set of registered DJ operators.

data_juicer_agents.utils.dj_config_bridge.get_dj_config_bridge() DJConfigBridge[source]#

Get singleton DJConfigBridge instance.

data_juicer_agents.utils.dj_config_bridge.coerce_fields(fields: Dict[str, Any]) Tuple[Dict[str, Any], List[str]][source]#

Coerce field values to their correct basic Python types via DJ parser.

Performs safe conversions for basic types (bool, int, float) by inspecting the DJ parser’s registered default-value types. Fields with non-basic target types or fields not registered in the parser are passed through unchanged.

This is used during normalization to ensure values serialise correctly in recipe YAML (e.g. "true" -> True, "4" -> 4).

Parameters:

fields – Dict of config fields to coerce.

Returns:

(coerced_fields, errors) where errors lists human-readable messages for any field that failed type coercion.