data_juicer.utils.datasets_json_compat module#

HuggingFace datasets parses JSON/JSON lines via pandas, which may call ujson. UltraJSON rejects some values that CPython's json accepts, notably very large integers, raising ValueError: Value is too big!.

Set environment variable DATA_JUICER_USE_STDLIB_JSON=1 (or true / yes / on) before running dj-process (or any code path that calls init_configs) to force the datasets stack to use json.loads instead.

data_juicer.utils.datasets_json_compat.apply_stdlib_json_patch_for_datasets() bool[源代码]#

If DATA_JUICER_USE_STDLIB_JSON is enabled, replace datasets.utils.json.ujson_loads with json.loads (bytes-safe).

备注

We patch both datasets.utils.json and datasets.packaged_modules.json.json because the latter imports ujson_loads at module load time (from ... import ujson_loads), which binds the function object directly. Modifying the source module's attribute does not affect already-bound references.

返回:

whether the patch was applied in this process.