data_juicer.utils.jsonl_lenient_loader module#
Stream local JSONL with stdlib json.loads(), skipping bad lines.
Used when HuggingFace's JSON builder (ujson) fails on some rows or when you
need per-line fault tolerance. Output is a normal datasets.Dataset,
so downstream operators behave the same as with the default JSONL loader.
- data_juicer.utils.jsonl_lenient_loader.iter_lenient_jsonl_records(file_ext_pairs: List[Tuple[str, str]], *, add_suffix_column: bool) Iterator[Dict[str, Any]][源代码]#
Yield one dict per valid JSON object line.
- 参数:
file_ext_pairs --
(file_path, ext_key)whereext_keyis the suffix key fromfind_files_with_suffix(e.g.".jsonl").add_suffix_column -- if True, set
Fields.suffixto match the default HF loader ("." + ext_key.strip(".")).