Evaluation Results Recorder#

Record your evaluation results to W&B (wandb) with wandb_writer.py.

With wandb_writer.py, you can:

visualize the changes of evaluation metrics of your model during the training process
make a leaderboard to compare the metrics of different models

Usage#

python wandb_writer.py --config <config_file> [--print-only]

config_file: path to the configuration file (see Configuration for details)
--print-only: only print the result to command line, do not write to wandb

Configuration#

We provided three example files in config folder for three different cases.

The general format is as follows:

project: <str>   # your wandb project name
base_url: <str>  # your wandb instance url
# other specific configuration items

Parse from HELM output#

The following configuration is used to parse evaluation results from HELM output folder and record them to wandb.

# general configurations
# ...

evals:  # evaluations to record
  - eval_type: helm    # only support helm for now
    model_name: <str>  # your model name
    source: helm  # use helm to parse from helm output directory
    helm_output_dir: <your helm output dir path>
    helm_suite_name: <your helm suite name>
    token_per_iteration: <tokens per iteration in billions>
    benchmarks:  # benchmark metrics to be recorded, and below are some examples
      - name: mmlu
        metrics:
          - EM
      - name: boolq
        metrics:
          - EM
      - name: narrative_qa
        metrics:
          - F1
      - name: hellaswag
        metrics:
          - EM
      - ...

We use 16 core metrics of HELM as the default benchmarks if the benchmarks field is not provided, the 16 metrics are as follows:

mmlu.EM, raft.EM, imdb.EM, truthful_qa.EM, summarization_cnndm.ROUGE-2, summarization_xsum.ROUGE-2, boolq.EM, msmarco_trec.NDCG@10, msmarco_regular.RR@10, narrative_qa.F1, natural_qa_closedbook.F1, natural_qa_openbook_longans.F1, civil_comments.EM, hellaswag.EM, openbookqa.EM

Parse from configuration file#

The scores of metrics can be given directly in the configuration file, the following is an example.

# general configurations
# ...

evals:  # evaluations to record
  - eval_type: helm
    model_name: llama-7B  # your model name
    source: file  # use file to parse from configuration
    token_num: 1000
    eval_result:  # evaluation results to be recorded
      mmlu:
        EM: 0.345
      boolq:
        EM: 0.751
      narrative_qa:
        F1: 0.524
      hellaswag:
        EM: 0.747
      ...

Make leaderboard#

The following configuration is used to make a leaderboard.

# general configurations
# ...
leaderboard: True
leaderboard_metrics:  # metrics required for the leaderboard
  - mmlu.EM
  - boolq.EM
  - quac.F1
  - hellaswag.EM
  - ...
excluded_models:   # models that do not participate in the leaderboard
  - <model to exclude>
  - ...

We use 16 core metrics of HELM as the default leaderboard metrics if the leaderboard_metrics field is not provided, the 16 metrics are as same as the default benchmark metrics.