Evaluation Guide
Evaluation can be performed in two ways: automatically during training, or as a standalone process using a saved checkpoint.
Evaluation During Training
While pretrain.py
is running, the model is periodically evaluated on the test set. The frequency of these evaluations is controlled by the eval_interval
parameter in your configuration. Metrics such as eval/exact_accuracy
are logged to Weights & Biases, allowing you to track performance over time.
Standalone Evaluation with evaluate.py
To evaluate a trained model checkpoint without resuming training, you can use the evaluate.py
script. This is useful for analyzing final performance or generating predictions for submission.
The script loads a checkpoint, runs inference on the entire test set, and saves the model's outputs (logits, puzzle identifiers, etc.) to disk.
Command:
OMP_NUM_THREADS=8 torchrun --nproc-per-node <NUM_GPUS> evaluate.py checkpoint=<PATH_TO_CHECKPOINT>
<PATH_TO_CHECKPOINT>
: The path to the saved model file, e.g.,checkpoints/MyProject/MyRun/step_100000
.
This will generate prediction files named step_<STEP>_all_preds.<RANK>
in the checkpoint directory, where RANK
is the GPU process ID.
ARC-Specific Evaluation Workflow
Evaluating performance on the ARC benchmark is a two-step process due to the use of data augmentation during training.
-
Generate Predictions: First, run the
evaluate.py
script as described above. This generates raw predictions for every augmented version of each test puzzle.# Example for an ARC-1 checkpoint OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=checkpoints/Arc-aug-1000.../step_414456
-
Aggregate and Score: Next, use the
arc_eval.ipynb
Jupyter notebook to process these raw predictions. The notebook performs several critical functions:- Loads all prediction files (
..._all_preds.*
). - Applies the inverse dihedral and color permutations to revert the augmentations.
- For each original test input, it aggregates the predictions from all its augmented variants.
- Performs voting or ranking based on prediction frequency and model confidence (Q-values).
- Calculates the final Top-K accuracy metrics and allows for visual inspection of the model's answers.
- Loads all prediction files (
To use it, open arc_eval.ipynb
in a Jupyter environment, set the DATASET_PATH
and CHECKPOINT_PATH
variables, and run the cells.
Pre-trained Checkpoints
We provide several pre-trained model checkpoints on Hugging Face:
Download a checkpoint and use the evaluation scripts to reproduce the results.