data_juicer.config.config module#

data_juicer.config.config.timing_context(description)[source]#

data_juicer.config.config.load_custom_operators(paths)[source]#: Dynamically load custom operator modules or packages in the specified path.

data_juicer.config.config.init_configs(args: List[str] | None = None, which_entry: object = None, load_configs_only=False)[source]#

initialize the jsonargparse parser and parse configs from one of:

POSIX-style commands line args;
config files in yaml (json and jsonnet supersets);
environment variables
hard-coded defaults

Parameters:

args – list of params, e.g., [’–config’, ‘cfg.yaml’], default None.
which_entry – which entry to init configs (executor/analyzer)
load_configs_only – whether to load the configs only, not including backing up config files, display them, and setting up logger.

Returns:

a global cfg object used by the DefaultExecutor or Analyzer

data_juicer.config.config.update_ds_cache_dir_and_related_vars(new_ds_cache_path)[source]#

data_juicer.config.config.init_setup_from_cfg(cfg: Namespace, load_configs_only=False)[source]#

Do some extra setup tasks after parsing config file or command line.

create working directory and logs directory
update cache directory
update checkpoint and temp_dir of tempfile

Parameters:

cfg – an original cfg
cfg – an updated cfg

data_juicer.config.config.load_ops_with_stats_meta()[source]#

data_juicer.config.config.update_op_attr(op_list: list, attr_dict: dict = None)[source]#

data_juicer.config.config.sort_op_by_types_and_names(op_name_classes)[source]#

Split ops items by op type and sort them to sub-ops by name, then concat together.

Parameters:: op_name_classes – a list of op modules
Returns:: sorted op list , each item is a pair of op_name and op_class

data_juicer.config.config.update_op_process(cfg, parser, used_ops=None)[source]#

Update operator process configuration with optimized performance.

Parameters:

cfg – Configuration namespace
parser – Argument parser
used_ops – Set of operator names that are actually used in the config

data_juicer.config.config.namespace_to_arg_list(namespace, prefix='', includes=None, excludes=None)[source]#

data_juicer.config.config.save_cli_arguments(cfg: Namespace)[source]#: Save CLI arguments to cli.yaml in the work directory.

data_juicer.config.config.validate_config_for_resumption(cfg: Namespace, work_dir: str, original_args: List[str] = None) → bool[source]#

Validate that the current config matches the job’s saved config for safe resumption.

Does verbatim comparison between: 1. Original config.yaml + cli.yaml (saved during job creation) 2. Current config (from current command)

Sets cfg._same_yaml_config = True/False for the executor to use.

data_juicer.config.config.config_backup(cfg: Namespace)[source]#

data_juicer.config.config.display_config(cfg: Namespace)[source]#

data_juicer.config.config.export_config(cfg: Namespace, path: str, format: str = 'yaml', skip_none: bool = True, skip_check: bool = True, overwrite: bool = False, multifile: bool = True)[source]#

Save the config object, some params are from jsonargparse

Parameters:

cfg – cfg object to save (Namespace type)
path – the save path
format – ‘yaml’, ‘json’, ‘json_indented’, ‘parser_mode’
skip_none – Whether to exclude entries whose value is None.
skip_check – Whether to skip parser checking.
overwrite – Whether to overwrite existing files.
multifile – Whether to save multiple config files by using the __path__ metas.

Returns:

data_juicer.config.config.merge_config(ori_cfg: Namespace, new_cfg: Namespace)[source]#

Merge configuration from new_cfg into ori_cfg

Parameters:

ori_cfg – the original configuration object, whose type is expected as namespace from jsonargparse
new_cfg – the configuration object to be merged, whose type is expected as dict or namespace from jsonargparse

Returns:

cfg_after_merge

data_juicer.config.config.prepare_side_configs(ori_config: str | Namespace | Dict)[source]#

parse the config if ori_config is a string of a config file path with: yaml, yml or json format

Parameters:: ori_config – a config dict or a string of a config file path with yaml, yml or json format
Returns:: a config dict

data_juicer.config.config.get_init_configs(cfg: Namespace | Dict, load_configs_only: bool = True)[source]#: set init configs of data-juicer for cfg

data_juicer.config.config.get_default_cfg()[source]#: Get default config values from config_min.yaml

data_juicer.config.config.prepare_cfgs_for_export(cfg)[source]#

data_juicer.config.config.resolve_job_id(cfg)[source]#: Resolve or auto-generate job_id and set it on cfg.

data_juicer.config.config.validate_work_dir_config(work_dir: str) → None[source]#

Validate work_dir configuration to ensure {job_id} placement rules are followed.

Parameters:: work_dir – The work_dir string to validate
Raises:: ValueError – If {job_id} is not at the end of the path

data_juicer.config.config.resolve_job_directories(cfg)[source]#

Centralize directory resolution and placeholder substitution. Assumes job_id is already set.

Job Directory Rules: - If work_dir contains ‘{job_id}’ placeholder, it MUST be the last part of the path - Examples:

✅ work_dir: “./outputs/my_project/{job_id}” # Valid ✅ work_dir: “/data/experiments/{job_id}” # Valid ❌ work_dir: “./outputs/{job_id}/results” # Invalid - {job_id} not at end ❌ work_dir: “./{job_id}/outputs/data” # Invalid - {job_id} not at end

If work_dir does NOT contain ‘{job_id}’, job_id will be appended automatically
Examples: work_dir: “./outputs/my_project” → work_dir: “./outputs/my_project/20250804_143022_abc123”

After resolution, work_dir will always include job_id at the end.

data_juicer.config.config module#

This Page