data_juicer_agents.core.tool.dataset_source module#
Unified dataset source descriptor.
Replaces the scattered dataset_path / dataset /
generated_dataset_config triple with a single, self-describing envelope
object. The envelope only enforces the exactly-one-of-three constraint;
the inner schema of config and generated is not duplicated here
— it stays in DatasetObjectConfig and
GeneratedDatasetConfig respectively and is validated when the
envelope is converted to a DatasetIOSpec via to_io_spec().
- class data_juicer_agents.core.tool.dataset_source.DatasetSource(*, path: str = '', config: Dict[str, Any] | None = None, generated: Dict[str, Any] | None = None)[源代码]#
基类:
BaseModelUnified dataset source envelope.
Exactly one of
path,config, orgeneratedmust be provided. Providing zero or more than one raises a validation error.Examples:
# Simple local file (shortcut) DatasetSource(path="/data/train.jsonl") # Structured load config (remote, multi-source, max_sample_num …) DatasetSource(config={ "configs": [ {"type": "local", "path": "/data/a.jsonl", "weight": 0.7}, {"type": "local", "path": "/data/b.jsonl", "weight": 0.3}, ], "max_sample_num": 50000, }) # Dynamic generation via Data-Juicer FORMATTERS DatasetSource(generated={"type": "text_formatter", ...})
- path: str#
- config: Dict[str, Any] | None#
- generated: Dict[str, Any] | None#
- to_legacy_args() Dict[str, Any][源代码]#
Convert to the legacy
(dataset_path, dataset, generated_dataset_config)dict.Returns a dict with exactly the three legacy keys so callers can unpack with
**source.to_legacy_args().
- classmethod from_legacy(dataset_path: str = '', dataset: Dict[str, Any] | None = None, generated_dataset_config: Dict[str, Any] | None = None) DatasetSource[源代码]#
Create a
DatasetSourcefrom the legacy triple.This is the primary migration bridge: CLI argument parsers and existing callers can keep their three-parameter interface and convert to the unified envelope at the boundary.
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].