Redpajama Config Files¶

The folder in Data-Juicer-Hub contains example configuration files to easily and quickly reproduce the processing flow of Redpajama.

Before starting, please clone the data-juicer-hub to local, which contains lots of data processing recipes.

git clone https://github.com/datajuicer/data-juicer-hub.git

arXiv¶

The raw data files can be downloaded from the same AWS link as in Redpajama/arXiv.

Once downloaded, use raw_arxiv_to_jsonl.py to convert from the original format to jsonl that Data-Juicer can handle easily:

python tools/preprocess/raw_arxiv_to_jsonl.py           \
    --arxiv_src_dir       <arxiv_src_dir>    \
    --target_dir          <target_dir>       \
    --temp_dir            <temp_dir>         \
    --num_proc            <num_proc>

After conversion, modify the path configurations in redpajama-arxiv.yaml and execute the following command to reproduce the processing flow of RedPajama:

python tools/process_data.py --config <path-to-data-juicer-hub>/reproduced_redpajama/redpajama-arxiv.yaml

Comparison¶

	num_samples	num_tokens	peak_memory	wall_time
redpajama	1,724,497	30,667,506,934	35GB	`total: 11h52min`
Data-Juicer	2,675,426	30,338,153,178	21GB	preprocess: 5h21min read+unify: 25min remove_header_mapper: 5min remove_comments_mapper: 3min remove_bibliography_mapper: 4min expand_macro_mapper: 5min19s text_length_filter: 4min export: 43min `total: 6h53min`

Books¶

The raw data files can be downloaded from the same HuggingFace datasets as in Redpajama/Books.

Once downloaded, modify the path configurations in redpajama-books.yaml and execute the following command to reproduce the processing flow of RedPajama.

python tools/process_data.py --config <path-to-data-juicer-hub>/reproduced_redpajama/redpajama-books.yaml

Comparison¶

	num_samples	num_tokens	peak_memory	wall_time
redpajama	205,183	25,962,395,123	450GB	split_for_dedup: 5min dedup: 117min `total: 122min`
Data-Juicer	207,902	26,108,635,683	96GB	read+unify: 20min compute_hash: 78min dedup: 3min export: 3min `total: 114min`

Code¶

The raw data files can be downloaded from Google BigQuery as in Redpajama/Code.

Once downloaded, unzip and delete files whose extensions are not in the following whitelist:

.asm, .bat, .cmd, .c, .h, .cs, .cpp, .hpp, .c++, .h++, .cc, .hh, .C, .H, .cmake, .css, .dockerfile, .f90, .f, .f03, .f08, .f77, .f95, .for, .fpp, .go, .hs, .html, .java, .js, .jl, .lua, .md, .markdown, .php, .php3, .php4, .php5, .phps, .phpt, .pl, .pm, .pod, .perl,  ps1, .psd1, .psm1, .py, .rb, .rs, .sql, .scala, .sh, .bash, .command, .zsh, .ts, .tsx, .tex, .vb, Dockerfile, Makefile, .xml, .rst, .m, .smali

After preparation, modify the path configurations in redpajama-code.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config <path-to-data-juicer-hub>/reproduced_redpajama/redpajama-code.yaml

Comparison¶

	num_samples	num_tokens	peak_memory	wall_time
redpajama	73,208,524	150,390,270,060	212GB	local-dedup: 37h global-dedup: 1h merge-dedup: 6h filter: 17h `total: 61h`
Data-Juicer	73,169,889	150,310,903,230	370GB	preprocess: 5h21min read+unify: 12h document_deduplicator: 20h clean_copyright_mappe: 3h maximum_line_length_filter: 2.5h average_line_length_filter: 2h alphanumeric_filter: 13h export: 2.5h `total: 59h`

StackExchange¶

The raw data files can be downloaded from the same Archive link as in Redpajama/Stack_exchange.

Once downloaded, use raw_stackexchange_to_jsonl.py to convert from the original format to jsonl that Data-Juicer can handle easily:

python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py           \
    --src_dir       <src_dir>      \
    --target_dir    <target_dir>   \
    --topk          <topk>         \
    --num_proc      <num_proc>     \

After conversion, modify the path configurations in redpajama-stackexchange.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config <path-to-data-juicer-hub>/reproduced_redpajama/redpajama-stackexchange.yaml

Comparison¶

	num_samples	num_tokens	peak_memory	wall_time
redpajama	29,825,086	20,502,757,123	>500GB	filter: 170min postprocess: 90min `total: 260min`
Data-Juicer	29,825,086	20,628,082,262	100GB	preprocess: 210min read+unify: 86min clean_html: 15min language_id_score_filter: 18min `total: 391min`