BLOOM Config Files

The folder in Data-Juicer-Hub contains example configuration files to easily and quickly reproduce the processing flow of the ROOTS dataset, created by the BigScience initiative to train the BLOOM models.

Oscar

The raw data files can be downloaded as described in BLOOM/Oscar. Then use bloom-oscar.yaml to perform the whole processing.

An analysis of our reproduction will be published soon.