data_juicer.utils.jsonl_lenient_loader module#

Stream local JSONL with stdlib json.loads(), skipping bad lines.

Used when HuggingFace’s JSON builder (ujson) fails on some rows or when you need per-line fault tolerance. Output is a normal datasets.Dataset, so downstream operators behave the same as with the default JSONL loader.

data_juicer.utils.jsonl_lenient_loader.iter_lenient_jsonl_records(file_ext_pairs: List[Tuple[str, str]], *, add_suffix_column: bool) Iterator[Dict[str, Any]][source]#

Yield one dict per valid JSON object line.

Parameters:
  • file_ext_pairs(file_path, ext_key) where ext_key is the suffix key from find_files_with_suffix (e.g. ".jsonl").

  • add_suffix_column – if True, set Fields.suffix to match the default HF loader ("." + ext_key.strip(".")).

data_juicer.utils.jsonl_lenient_loader.dataset_from_lenient_jsonl_files(file_ext_pairs: List[Tuple[str, str]], *, add_suffix_column: bool) Dataset[source]#

Build a datasets.Dataset by streaming all given JSONL files.