data_juicer.utils.datasets_json_compat module#
HuggingFace datasets parses JSON/JSON lines via pandas, which may call
ujson. UltraJSON rejects some values that CPython's json accepts,
notably very large integers, raising ValueError: Value is too big!.
Set environment variable DATA_JUICER_USE_STDLIB_JSON=1 (or true /
yes / on) before running dj-process (or any code path that calls
init_configs) to force the datasets stack to use json.loads instead.
- data_juicer.utils.datasets_json_compat.apply_stdlib_json_patch_for_datasets() bool[源代码]#
If
DATA_JUICER_USE_STDLIB_JSONis enabled, replacedatasets.utils.json.ujson_loadswithjson.loads(bytes-safe).备注
We patch both
datasets.utils.jsonanddatasets.packaged_modules.json.jsonbecause the latter importsujson_loadsat module load time (from ... import ujson_loads), which binds the function object directly. Modifying the source module's attribute does not affect already-bound references.- 返回:
whether the patch was applied in this process.