Evaluation Guide

Evaluation can be performed in two ways: automatically during training, or as a standalone process using a saved checkpoint.

Evaluation During Training

While pretrain.py is running, the model is periodically evaluated on the test set. The frequency of these evaluations is controlled by the eval_interval parameter in your configuration. Metrics such as eval/exact_accuracy are logged to Weights & Biases, allowing you to track performance over time.

Standalone Evaluation with evaluate.py

To evaluate a trained model checkpoint without resuming training, you can use the evaluate.py script. This is useful for analyzing final performance or generating predictions for submission.

The script loads a checkpoint, runs inference on the entire test set, and saves the model's outputs (logits, puzzle identifiers, etc.) to disk.

Command:

OMP_NUM_THREADS=8 torchrun --nproc-per-node <NUM_GPUS> evaluate.py checkpoint=<PATH_TO_CHECKPOINT>
  • <PATH_TO_CHECKPOINT>: The path to the saved model file, e.g., checkpoints/MyProject/MyRun/step_100000.

This will generate prediction files named step_<STEP>_all_preds.<RANK> in the checkpoint directory, where RANK is the GPU process ID.

ARC-Specific Evaluation Workflow

Evaluating performance on the ARC benchmark is a two-step process due to the use of data augmentation during training.

  1. Generate Predictions: First, run the evaluate.py script as described above. This generates raw predictions for every augmented version of each test puzzle.

    # Example for an ARC-1 checkpoint
    OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=checkpoints/Arc-aug-1000.../step_414456
  2. Aggregate and Score: Next, use the arc_eval.ipynb Jupyter notebook to process these raw predictions. The notebook performs several critical functions:

    • Loads all prediction files (..._all_preds.*).
    • Applies the inverse dihedral and color permutations to revert the augmentations.
    • For each original test input, it aggregates the predictions from all its augmented variants.
    • Performs voting or ranking based on prediction frequency and model confidence (Q-values).
    • Calculates the final Top-K accuracy metrics and allows for visual inspection of the model's answers.

To use it, open arc_eval.ipynb in a Jupyter environment, set the DATASET_PATH and CHECKPOINT_PATH variables, and run the cells.

Pre-trained Checkpoints

We provide several pre-trained model checkpoints on Hugging Face:

Download a checkpoint and use the evaluation scripts to reproduce the results.