Refine Alpaca-CoT Config Files#

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess#

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl#

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra [ META ](datajuicer/data-juicer #meta_info) info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language#

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process#

After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml

# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml

Meta Info #

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info#

  • Language Tags:

    • EN: Instruction datasets in English

    • CN: Instruction datasets in Chinese

    • ML: [Multi-lingual] Instruction datasets in multiple languages

  • Task Tags

    • MT: [Multi-task] Datasets containing multiple tasks

    • TS: [Task-specific] Datasets tailored for specific tasks

  • Generation-method:

    • HG: [Human Generated Dataset] Datasets created by humans

    • SI: [Self-Instruct] Datasets generated using self-instruct methods

    • MIX: [Mixed Dataset] Dataset contains both human and machine generated data

    • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info#

  • Dataset: dataset name in Alpaca-CoT

  • origin_path: original file path in Alpaca-CoT

  • IFT: tagged as Instruct Fine-Tuning datasets

  • CFT: tagged as Chat Fine-Tuning datasets

    • CFT-SR: tagged as Single-round Dialog datasets

    • CFT-MR: tagged as Multi-round Dialog datasets

    • CFT-P: tagged as Preference datasets

Refined Alpaca-CoT dataset Meta info#

Task

Gen

Lang

Dataset

IFT

CFT-SR

CFT-MR

CFT-P

Chain-of-Thought

MT

HG

EN/CN

Chain-of-Thought

✅

GPT4all

MT

COL

EN

GPT4all

✅

✅

GPTeacher

MT

SI

EN

GPTeacher

✅

Guanaco

MT

SI

ML

Guanaco

✅

HC3

TS

MIX

EN/CN

HC3

✅

✅

alpaca

MT

SI

EN

alpaca

✅

Natural-Instructions

MT

COL

ML

Natural-Instructions

✅

belle_cn

TS/MT

SI

CN

belle_cn

✅

instinwild

MT

SI

EN/CN

instinwild

✅

prosocial-dialog

TS

MIX

EN

prosocial-dialog

✅

finance

TS

COL

EN

finance

✅

xP3

MT

COL

ML

xP3

✅

firefly

MT

COL

CN

firefly

✅

instruct

MT

COL

EN

instruct

✅

CodeAlpaca

TS

SI

EN

CodeAlpaca

✅

alpacaGPT4

MT

SI

EN/CN

alpacaGPT4

✅

✅

webGPT

TS

MIX

EN

webGPT

✅

✅

dolly

TS

HG

EN

dolly

✅

baize

MT

COL

EN

baize

✅

hh-rlhf

TS

MIX

EN

hh-rlhf

✅

✅

✅

OIG

MT

COL

EN

OIG

✅

GAOKAO

MT

COL

CN

GAOKAO

✅

camel

MT

SI

EN

camel

✅

FLAN-Muffin

MT

COL

EN

FLAN-Muffin

✅

COIG

MT

COL

CN

COIG

✅

gpt4tools

MT

SI

EN

gpt4tools

✅

ShareGPT

MT

MIX

EN

ShareGPT

✅

✅

Auto-CoT

MT

COL

EN

Auto-CoT

✅

MOSS

TS

SI

EN/CN

MOSS

✅

ultrachat

TS

SI

EN

ultrachat

✅

Chinese-medical

TS

COL

CN

Chinese-medical

✅

CSL

MT

COL

CN

CSL

✅

pCLUE

MT

COL

CN

pCLUE

✅

news_commentary

TS

COL

CN

news_commentary

✅

StackExchange

MT

COL

EN

StackExchange

✅

✅

ConvAI2

TS

HG

EN

ConvAI2

✅

FastChat

MT

SI

EN

FastChat

✅

Tabular-LLM-Data

MT

COL

EN/CN

Tabular-LLM-Data

âœ