Postprocess tools#
This folder contains some postprocess scripts for additional processing of your processed dataset using Data-Juicer.
Usage#
Count tokens for datasets#
Use count_token.py to count tokens for datasets.
python tools/postprocess/count_token.py \
--data_path <data_path> \
--text_keys <text_keys> \
--tokenizer_method <tokenizer_method> \
--num_proc <num_proc>
# get help
python tools/postprocess/count_token.py --help
data_path: path to the input dataset. Only supportjsonlnow.text_keys: field keys that will be considered into token counts.tokenizer_method: name of the Hugging Face tokenizer.num_proc: number of processes to count tokens.
Mix multiple datasets with optional weights#
Use data_mixture.py to mix multiple datasets.
This script will randomly select samples from every dataset and mix these samples and export to a new_dataset.
python tools/postprocess/data_mixture.py \
--data_path <data_path> \
--export_path <export_path> \
--export_shard_size <export_shard_size> \
--num_proc <num_proc>
# get help
python tools/postprocess/data_mixture.py --help
data_path: a dataset file or a list of dataset files or a list of both them, optional weights, if not set, 1.0 as default.export_path: a dataset file name for exporting mixed dataset, supportjson/jsonl/parquet.export_shard_size: dataset file size in Byte. If not set, mixed dataset will be exported into only one file.num_proc: process num to load and export datasets.e.g.,
python tools/postprocess/data_mixture.py --data_path <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json
Note: All datasets must have the same meta field, so we can use HuggingFace Datasets to align their features.
Deserialize meta fields in jsonl file#
This tool is usually used with serialize_meta.py to deserialize the specified field into the original format.
python tools/postprocess/deserialize_meta.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--serialized_key <serialized_key> \
--num_proc <num_proc>
# get help
python tools/postprocess/deserialize_meta.py --help
src_dir: path to store jsonl files.target_dir: path to save the converted jsonl files.serialized_key: the key corresponding to the field that will be deserialized. Default it’s ‘source_info’.num_proc(optional): number of process workers. Default it’s 1.
Note: After deserialization, all serialized fields in the original file will be placed in 'serialized_key', this is to ensure that the fields generated after data-juicer processing will not conflict with the original meta fields.