data_juicer_agents.core.tool.dataset_source module#

Unified dataset source descriptor.

Replaces the scattered dataset_path / dataset / generated_dataset_config triple with a single, self-describing envelope object. The envelope only enforces the exactly-one-of-three constraint; the inner schema of config and generated is not duplicated here — it stays in DatasetObjectConfig and GeneratedDatasetConfig respectively and is validated when the envelope is converted to a DatasetIOSpec via to_io_spec().

class data_juicer_agents.core.tool.dataset_source.DatasetSource(*, path: str = '', config: Dict[str, Any] | None = None, generated: Dict[str, Any] | None = None)[源代码]#

基类:BaseModel

Unified dataset source envelope.

Exactly one of path, config, or generated must be provided. Providing zero or more than one raises a validation error.

Examples:

# Simple local file (shortcut)
DatasetSource(path="/data/train.jsonl")

# Structured load config (remote, multi-source, max_sample_num …)
DatasetSource(config={
    "configs": [
        {"type": "local", "path": "/data/a.jsonl", "weight": 0.7},
        {"type": "local", "path": "/data/b.jsonl", "weight": 0.3},
    ],
    "max_sample_num": 50000,
})

# Dynamic generation via Data-Juicer FORMATTERS
DatasetSource(generated={"type": "text_formatter", ...})
path: str#
config: Dict[str, Any] | None#
generated: Dict[str, Any] | None#
to_legacy_args() Dict[str, Any][源代码]#

Convert to the legacy (dataset_path, dataset, generated_dataset_config) dict.

Returns a dict with exactly the three legacy keys so callers can unpack with **source.to_legacy_args().

classmethod from_legacy(dataset_path: str = '', dataset: Dict[str, Any] | None = None, generated_dataset_config: Dict[str, Any] | None = None) DatasetSource[源代码]#

Create a DatasetSource from the legacy triple.

This is the primary migration bridge: CLI argument parsers and existing callers can keep their three-parameter interface and convert to the unified envelope at the boundary.

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].