Data-Juicer-Hub#
Community-driven data-juicer recipes and best practices for various pre-training/fine-tuning tasks.
Documentation#
Detail documentation about the recipes can be found here.
Quick Start#
There are plenty of prepared recipes for data processing on different tasks. You can make use of them by cloning this repo and set the `โconfig`` with the local path of the target recipe file:
# clone this repo to somewhere on your local machine
git clone https://github.com/datajuicer/data-juicer-hub.git
# run with the actual local path to the target recipe
dj-process --config <root-of-data-juicer-hub>/demo/process.yaml --dataset_path <your-dataset-path>
If you prefer learning and using Data-Juicer through interactive Notebooks, you can switch to the notebook branch:
# Switch to the notebook branch
git checkout notebook
This branch contains detailed Data-Juicer Notebook tutorials. You can refer to the online documentation for usage guidance.
Contributing#
This is a community-driven repo, so feel free to upload your own recipes to this repo! ๐