data_juicer.format.json_formatter module#

class data_juicer.format.json_formatter.JsonFormatter(dataset_path, suffixes=None, **kwargs)[源代码]#

基类:LocalFormatter

Load json-type files.

Default suffixes include .json, .jsonl, gzip/zstd variants.

Optional lenient JSONL: load_jsonl_lenient: true or env DATA_JUICER_JSONL_LENIENT=1 streams jsonl-only inputs with stdlib json.loads(), skipping bad lines (avoids HF ujson for those files).

SUFFIXES = ['.json', '.jsonl', '.json.gz', '.jsonl.gz', '.json.zst', '.jsonl.zst']#
__init__(dataset_path, suffixes=None, **kwargs)[源代码]#

Initialization method.

参数:
  • dataset_path -- a dataset file or a dataset directory

  • suffixes -- files with specified suffixes to be processed

  • kwargs -- extra args

load_dataset(num_proc=None, global_cfg=None)[源代码]#

Load a dataset from dataset file or dataset directory, and unify its format.

参数:
  • num_proc -- number of processes when loading the dataset

  • global_cfg -- global cfg used in consequent processes,

返回:

formatted dataset