Data-Juicer-Hub#

Community-driven data-juicer recipes and best practices for various pre-training/fine-tuning tasks.

Documentation#

Detail documentation about the recipes can be found here.

Quick Start#

There are plenty of prepared recipes for data processing on different tasks. You can make use of them by cloning this repo and set the `–config`` with the local path of the target recipe file:

# clone this repo to somewhere on your local machine
git clone https://github.com/datajuicer/data-juicer-hub.git
# run with the actual local path to the target recipe
dj-process --config <root-of-data-juicer-hub>/demo/process.yaml --dataset_path <your-dataset-path>

Contributing#

This is a community-driven repo, so feel free to upload your own recipes to this repo! 😄