Data Processing and Pipeline
Effective data preparation is crucial for HRM's performance. The data pipeline transforms raw datasets from various sources into a unified, efficient format that the model can consume. This process involves parsing, augmentation, and serialization into NumPy arrays.
1. Raw Data Sources
The project sources data from several locations, which are managed as Git submodules in the dataset/raw-data/
directory. This includes:
- ARC-AGI and ConceptARC: For the Abstraction and Reasoning Corpus tasks.
- Sudoku Puzzles: Sourced from Hugging Face Hub as CSV files.
- Mazes: Also sourced from Hugging Face Hub.
2. Build Scripts
A collection of scripts in the dataset/
directory is responsible for processing this raw data:
dataset/build_arc_dataset.py
dataset/build_sudoku_dataset.py
dataset/build_maze_dataset.py
These scripts perform several key steps:
- Parsing: They read the source files (JSON for ARC, CSV for Sudoku/Mazes) and extract the input/output pairs.
- Augmentation: To improve generalization and expand the training set, they apply domain-specific data augmentations.
- ARC: Dihedral transformations (rotations, flips) and color permutations are applied to the grids. See
dataset/common.py
. - Sudoku: Valid transformations like digit remapping and permutation of rows/columns within bands/stacks are applied. See
shuffle_sudoku
indataset/build_sudoku_dataset.py
.
- ARC: Dihedral transformations (rotations, flips) and color permutations are applied to the grids. See
- Serialization: The processed data, including all augmentations, is converted into a set of NumPy (
.npy
) files.
3. The Processed Dataset Format
After a build script is run, it creates an output directory (e.g., data/arc-aug-1000/
) containing subdirectories for train
and test
splits. Each split directory contains the following files:
inputs.npy
: A 2D array of[num_examples, seq_len]
containing the flattened input grids.labels.npy
: A 2D array of[num_examples, seq_len]
containing the flattened output grids.puzzle_identifiers.npy
: An array mapping each puzzle to a unique integer ID.puzzle_indices.npy
: An index array that maps a puzzle ID to the range of its corresponding examples ininputs.npy
andlabels.npy
.group_indices.npy
: An index array that groups different augmentations of the same original puzzle together.dataset.json
: A metadata file containing information like sequence length, vocabulary size, and other dataset statistics.identifiers.json
: A global mapping from integer puzzle IDs back to their original string names.
This structure is optimized for fast, memory-mapped loading during training.
4. Data Loading in Training
The puzzle_dataset.py
script defines the PuzzleDataset
class, which is a PyTorch IterableDataset
responsible for loading the processed .npy
files.
- Memory Efficiency: It uses
mmap_mode='r'
to load the largeinputs
andlabels
arrays, keeping memory usage low. - Training Mode: During training (
test_set_mode=False
), it samples by "groups." This ensures that in each epoch, the model sees a random augmentation of each puzzle, promoting better generalization. - Evaluation Mode: During evaluation (
test_set_mode=True
), it iterates through all examples sequentially to ensure a complete and consistent evaluation.
5. Visualizing the Dataset
To help inspect and debug the processed data, the repository includes puzzle_visualizer.html
. This is a self-contained HTML file that can load a processed dataset directory locally in your browser. It allows you to navigate through groups and puzzles and see the input/output grids for each example, which is useful for verifying that augmentations and processing steps were applied correctly.