BLOOM Config Files#
The folder in Data-Juicer-Hub contains example configuration files to easily and quickly reproduce the processing flow of the ROOTS dataset, created by the BigScience initiative to train the BLOOM models.
Oscar#
The raw data files can be downloaded as described in BLOOM/Oscar. Then use bloom-oscar.yaml to perform the whole processing.
An analysis of our reproduction will be published soon.