Notebooks Tutorial#
Welcome to the Notebooks Tutorial Library of Data Juicer Hub! This branch contains a series of Jupyter notebooks to help you get started with Data Juicer quickly.
๐ Quick Start#
In addition to the methods below, you can also try JupyterLab Playground with Tutorials.
Method 1: Github Codespace#
Launch Codespace
Click the
Codebutton in this repository and select theCodespacestabClick
Create codespace on notebook(the+icon) to start the environmentWait a moment and youโll see the VSCode Web interface
Select and Run Notebooks
Find the
notebooksfolder in the left file directoryClick on the notebook file youโre interested in
Click the kernel selector in the top right corner and choose the
data-juicer-hubenvironment (located in the.venvdirectory)Start running!
Method 2: Google Colab#
Click the links below to run the tutorials online in Google Colab:
๐ Tutorial Content Overview#
01_Getting_Started - Data Juicer core concepts and quick start guide
02_Building_Recipes - Design and construction of data processing recipes
03_Data_Formats_and_Loading - Comprehensive guide to data formats and loading methods
04_DJ_Dataset_API - Complete Dataset API guide
05_Operators_Usage - Detailed explanation of data processing operators (both YAML and Python modes)
06_Analysis_and_Visualization - Data analysis and visualization tools
07_Distributed_Processing_with_Ray - Ray distributed processing framework integration
08_Preprocessing - Data preprocessing scripts
09_Multimodal_Data_Processing - Multimodal data processing capabilities
10_Advanced_Dataset_Configuration - Advanced Dataset configuration options
๐ก Recommended Learning Paths#
Quick Start
01 โ 02 โ 05 โ 06 โ 07
For: Learners who want to quickly understand and use Data Juicerโs core capabilities
Data Integration
03 โ 08, 09, 10
For: Developers who already understand Data Juicerโs core capabilities and need to integrate their own data into processing pipelines
Programming and Customization
01 โ 03 โ 04 โ 05 โ 07
For: Engineers who want to flexibly customize data processing pipelines with code
๐ More Resources#
Happy Learning! ๐