data_juicer.format.json_formatter module#
- class data_juicer.format.json_formatter.JsonFormatter(dataset_path, suffixes=None, **kwargs)[source]#
Bases:
LocalFormatterLoad json-type files.
Default suffixes include
.json,.jsonl, gzip/zstd variants.Optional lenient JSONL:
load_jsonl_lenient: trueor envDATA_JUICER_JSONL_LENIENT=1streams jsonl-only inputs with stdlibjson.loads(), skipping bad lines (avoids HF ujson for those files).- SUFFIXES = ['.json', '.jsonl', '.json.gz', '.jsonl.gz', '.json.zst', '.jsonl.zst']#