Dataset Configuration Guide#

This guide provides an overview of how to configure datasets using YAML format in the Data-Juicer framework. The configurations allow you to specify local and remote datasets, with data validation rules.

Supported Dataset Formats#

Local Dataset#

The local_json.yaml configuration file is used to specify datasets stored locally in JSON format. path is required to specify the local dataset path, either a single file or a directory. format is optional to specify the dataset format. For local files, DJ will automatically detect the file format and load the dataset accordingly. Formats like parquet, jsonl, json, csv, tsv, txt, and jsonl.gz are supported Refer to local_json.yaml for more details.

dataset:
  configs:
    - type: local
      path: path/to/your/local/dataset.json
      format: json
dataset:
  configs:
    - type: local
      path: path/to/your/local/dataset.parquet
      format: parquet

Remote Huggingface Dataset#

The remote_huggingface.yaml configuration file is used to specify huggingface datasets. type and source are fixed to ‘remote’ and ‘huggingface’ to locate huggingface loading logic. path is required to identify the huggingface dataset. name, split and limit are optional to specify the dataset name/split and limit the number of samples to load. Refer to remote_huggingface.yaml for more details.

dataset:
  configs:
    - type: 'remote'
      source: 'huggingface'
      path: "HuggingFaceFW/fineweb"
      name: "CC-MAIN-2024-10"
      split: "train"
      limit: 1000

Remote Arxiv Dataset#

The remote_arxiv.yaml configuration file is used to specify datasets stored remotely in JSON format. type and source are fixed to ‘remote’ and ‘arxiv’ to locate arxiv loading logic. lang, dump_date, force_download and url_limit are optional to specify the dataset language, dump date, force download and url limit. Refer to remote_arxiv.yaml for more details.

dataset:
  configs:
    - type: 'remote'
      source: 'arxiv'
      lang: 'en'
      dump_date: 'latest'
      force_download: false
      url_limit: 2

Other Supported Dataset Formats#

Refer to load_strategy.py for more details and supported dataset formats.

Other features#

Data Mixture#

The mixture.yaml configuration file demonstrates how to specify data mixture rules. DJ will mix the datasets by sampling a portion of the dataset and applying proper weights. Refer to mixture.yaml for more details.

dataset:
  max_sample_num: 10000
  configs:
    - type: 'local'
      weight: 1.0
      path: 'path/to/json/file'
    - type: 'local'
      weight: 1.0
      path: 'path/to/csv/file'

Data Validation#

The validator.yaml configuration file demonstrates how to specify data validation rules. DJ will validate the dataset by sampling a portion of the dataset and applying the validation rules. Refer to data_validator.py for more details and supported validators.

dataset:
  configs:
    - type: local
      path: path/to/data.json

validators:
  - type: swift_messages
    min_turns: 2
    max_turns: 20
    sample_size: 1000
  - type: required_fields
    required_fields:
      - "text"
      - "metadata"
      - "language"
    field_types:
      text: "str"
      metadata: "dict"
      language: "str"

JSONL per-line fault tolerance (skip bad lines)#

For a few corrupted lines or parser failures in the HuggingFace JSON/ujson path, enable lenient JSONL loading: read with stdlib :func:json.loads line by line, skip lines that fail parsing (with warnings), and keep the rest. The result is still a HuggingFace Dataset, so downstream ops behave like normal JSONL.

Enable (either works):

load_jsonl_lenient: true
DATA_JUICER_JSONL_LENIENT=1 dj-process --config path/to/config.yaml

Constraints:

  • Only .jsonl / .jsonl.gz / .jsonl.zst shards are read. Other matched files (e.g. .json in the same folder) are skipped with a warning so the loader does not fall back to HuggingFace/ujson (which would hit Value is too big! again). Use suffixes: ['.jsonl'] if needed.

  • Intended for DefaultExecutor local JSONL; unrelated to Parquet.

  • Search logs for [lenient jsonl] for skipped lines.

JSON / JSONL load error: Value is too big!#

When loading local JSONL, HuggingFace datasets may parse lines with ujson (via pandas). Very large JSON integers (e.g. long numeric IDs) can exceed what ujson supports and raise ValueError: Value is too big!. This is usually about numeric fields, not necessarily huge strings.

Mitigations:

  1. Preferred (no data rewrite): force stdlib json before running:

    DATA_JUICER_USE_STDLIB_JSON=1 dj-process --config path/to/config.yaml
    
  2. At source: export problematic fields as strings (quoted in JSON).

  3. Other formats: e.g. Parquet, to avoid this JSON code path.

Legacy dataset_path Configuration#

The dataset_path configuration is the original way to specify the dataset path. It’s simplistic and easy to use, but lacks flexibility. It can be used in yaml or command line input. Some examples:

Command line input:

# command line input
dj-process --dataset_path path/to/your/dataset.json

# command line input with weights
dj-process --dataset_path 0.5 path/to/your/dataset1.json 0.5 path/to/your/dataset2.json

Yaml input:

dataset_path: path/to/your/dataset.json