Postprocess tools#

This folder contains some postprocess scripts for additional processing of your processed dataset using Data-Juicer.

Usage#

Count tokens for datasets#

Use count_token.py to count tokens for datasets.

python tools/postprocess/count_token.py        \
    --data_path            <data_path>         \
    --text_keys            <text_keys>         \
    --tokenizer_method     <tokenizer_method>  \
    --num_proc             <num_proc>

# get help
python tools/postprocess/count_token.py --help
  • data_path: path to the input dataset. Only support jsonl now.

  • text_keys: field keys that will be considered into token counts.

  • tokenizer_method: name of the Hugging Face tokenizer.

  • num_proc: number of processes to count tokens.

Mix multiple datasets with optional weights#

Use data_mixture.py to mix multiple datasets.

This script will randomly select samples from every dataset and mix these samples and export to a new_dataset.

python tools/postprocess/data_mixture.py        \
    --data_path             <data_path>         \
    --export_path           <export_path>       \
    --export_shard_size     <export_shard_size> \
    --num_proc              <num_proc>

# get help
python tools/postprocess/data_mixture.py  --help
  • data_path: a dataset file or a list of dataset files or a list of both them, optional weights, if not set, 1.0 as default.

  • export_path: a dataset file name for exporting mixed dataset, support json / jsonl / parquet.

  • export_shard_size: dataset file size in Byte. If not set, mixed dataset will be exported into only one file.

  • num_proc: process num to load and export datasets.

  • e.g., python tools/postprocess/data_mixture.py  --data_path  <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json

Note: All datasets must have the same meta field, so we can use HuggingFace Datasets to align their features.

Deserialize meta fields in jsonl file#

This tool is usually used with serialize_meta.py to deserialize the specified field into the original format.

python tools/postprocess/deserialize_meta.py           \
    --src_dir           <src_dir>         \
    --target_dir        <target_dir>      \
    --serialized_key    <serialized_key>  \
    --num_proc          <num_proc>

# get help
python tools/postprocess/deserialize_meta.py --help
  • src_dir: path to store jsonl files.

  • target_dir: path to save the converted jsonl files.

  • serialized_key: the key corresponding to the field that will be deserialized. Default it’s ‘source_info’.

  • num_proc (optional): number of process workers. Default it’s 1.

Note: After deserialization, all serialized fields in the original file will be placed in 'serialized_key', this is to ensure that the fields generated after data-juicer processing will not conflict with the original meta fields.