BLOOM Config Files#

The folder in Data-Juicer-Hub contains example configuration files to easily and quickly reproduce the processing flow of the ROOTS dataset, created by the BigScience initiative to train the BLOOM models.

Oscar#

The raw data files can be downloaded as described in BLOOM/Oscar. Then use bloom-oscar.yaml to perform the whole processing.

An analysis of our reproduction will be published soon.