Evaluation Guide
Evaluation can be performed in two ways: automatically during training, or as a standalone process using a saved checkpoint.
Evaluation During Training
While pretrain.py is running, the model is periodically evaluated on the test set. The frequency of these evaluations is controlled by the eval_interval parameter in your configuration. Metrics such as eval/exact_accuracy are logged to Weights & Biases, allowing you to track performance over time.
Standalone Evaluation with evaluate.py
To evaluate a trained model checkpoint without resuming training, you can use the evaluate.py script. This is useful for analyzing final performance or generating predictions for submission.
The script loads a checkpoint, runs inference on the entire test set, and saves the model's outputs (logits, puzzle identifiers, etc.) to disk.
Command:
OMP_NUM_THREADS=8 torchrun --nproc-per-node <NUM_GPUS> evaluate.py checkpoint=<PATH_TO_CHECKPOINT>
<PATH_TO_CHECKPOINT>: The path to the saved model file, e.g.,checkpoints/MyProject/MyRun/step_100000.
This will generate prediction files named step_<STEP>_all_preds.<RANK> in the checkpoint directory, where RANK is the GPU process ID.
ARC-Specific Evaluation Workflow
Evaluating performance on the ARC benchmark is a two-step process due to the use of data augmentation during training.
-
Generate Predictions: First, run the
evaluate.pyscript as described above. This generates raw predictions for every augmented version of each test puzzle.# Example for an ARC-1 checkpoint OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=checkpoints/Arc-aug-1000.../step_414456 -
Aggregate and Score: Next, use the
arc_eval.ipynbJupyter notebook to process these raw predictions. The notebook performs several critical functions:- Loads all prediction files (
..._all_preds.*). - Applies the inverse dihedral and color permutations to revert the augmentations.
- For each original test input, it aggregates the predictions from all its augmented variants.
- Performs voting or ranking based on prediction frequency and model confidence (Q-values).
- Calculates the final Top-K accuracy metrics and allows for visual inspection of the model's answers.
- Loads all prediction files (
To use it, open arc_eval.ipynb in a Jupyter environment, set the DATASET_PATH and CHECKPOINT_PATH variables, and run the cells.
Pre-trained Checkpoints
We provide several pre-trained model checkpoints on Hugging Face:
Download a checkpoint and use the evaluation scripts to reproduce the results.