Auto Evaluation Toolkit#

Automatically evaluate your model and monitor changes of metrics during the training process.

Preparation#

  1. Multiple GPU machines (at least 2, one for evaluation, the other for training).

  2. Mount a shared file system (e.g., NAS) to the same path (e.g., /mnt/shared) on the above machines.

  3. Install Data-Juicer in the shared file system (e.g., /mnt/shared/code/data-juicer).

  4. Install thirdparty dependencies (Megatron-LM and HELM) accoroding to thirdparty/README.md on each machine.

  5. Prepare your dataset and tokenizer, preprocess your dataset with Megatron-LM into mmap format (see README of Megatron-LM for more details) in the shared file system (e.g., /mnt/shared/dataset).

  6. Run Megatron-LM on training machines and save the checkpoint in the shared file system (e.g., /mnt/shared/checkpoints).

Usage#

Use evaluator.py to automatically evaluate your models with HELM and OpenAI API.

python tools/evaluator.py  \
    --config <config>      \
    --begin-iteration     <begin_iteration>     \
    [--end-iteration      <end_iteration>]      \
    [--iteration-interval <iteration_interval>] \
    [--check-interval <check_interval>]         \
    [--model-type     <model_type>]             \
    [--eval-type      <eval_type>]
  • config: a yaml file containing various settings required to run the evaluation (see Configuration for details)

  • begin_iteration: iteration of the first checkpoint to be evaluated

  • end_iteration: iteration of the last checkpoint to be evaluated. If not set, continuously monitor the training process and evaluate the generated checkpoints.

  • iteration_interval: iteration interval between two checkpoints, default is 1000 iterations

  • check_interval: time interval between checks, default is 30 minutes

  • model_type: type of your model, support megatron and huggingface for now

    • megatron: evaluate Megatron-LM checkpoints (default)

    • huggingface: evaluate HuggingFace model, only support gpt eval type

  • eval-type: type of the evaluation to run, support helm and gpt for now

    • helm: evaluate your model with HELM (default), you can change the benchmarks to run by modifying the helm specific template file

    • gpt: evaluate your model with OpenAI API, more details can be found in gpt_eval/README.md.

e.g.,

python evaluator.py --config <config_file> --begin-iteration 2000 --iteration-interval 1000 --check-interval 10

will use HELM to evaluate a Megatron-LM checkpoint every 1000 iterations starting from iteration 2000, and check whether there is a new checkpoint meets the condition every 10 minutes.

After running the evaluator.py, you can use recorder/wandb_writer.py to visualize the evaluation results, more details can be found in recorder/README.md.

Configuration#

The format of config_file is as follows:

auto_eval:
  project_name: <str> # your project name
  model_name: <str>   # your model name
  cache_dir: <str>    # path of cache dir
  megatron:
    process_num: <int>     # number of process to run megatron
    megatron_home: <str>   # root dir of Megatron-LM
    checkpoint_path: <str> # path of checkpoint dir
    tokenizer_type: <str>  # support gpt2 or sentencepiece for now
    vocab_path: <str>      # configuration for gpt2 tokenizer type, path to vocab file
    merge_path: <str>      # configuration for gpt2 tokenizer type, path to merge file
    tokenizer_path: <str>  # configuration for sentencepiece tokenizer type, path to model file
    max_tokens: <int>      # max tokens to generate in inference
    token_per_iteration: <float> # billions tokens per iteration
  helm:
    helm_spec_template_path: <str> # path of helm spec template file, default is tools/evaluator/config/helm_spec_template.conf
    helm_output_path: <str>  # path of helm output dir
    helm_env_name: <str>     # helm conda env name
  gpt_evaluation:
    # openai config
    openai_api_key: <str>       # your api key
    openai_organization: <str>  # your organization
    # files config
    question_file: <str>  # default is tools/evaluator/gpt_eval/config/question.jsonl
    baseline_file: <str>  # default is tools/evaluator/gpt_eval/answer/openai/gpt-3.5-turbo.jsonl
    prompt_file: <str >   # default is tools/evaluator/gpt_eval/config/prompt.jsonl
    reviewer_file: <str>  # default is tools/evaluator/gpt_eval/config/reviewer.jsonl
    answer_file: <str>    # path to generated answer file
    result_file: <str>    # path to generated review file