Data Processing and Pipeline

Effective data preparation is crucial for HRM's performance. The data pipeline transforms raw datasets from various sources into a unified, efficient format that the model can consume. This process involves parsing, augmentation, and serialization into NumPy arrays.

1. Raw Data Sources

The project sources data from several locations, which are managed as Git submodules in the dataset/raw-data/ directory. This includes:

ARC-AGI and ConceptARC: For the Abstraction and Reasoning Corpus tasks.
Sudoku Puzzles: Sourced from Hugging Face Hub as CSV files.
Mazes: Also sourced from Hugging Face Hub.

2. Build Scripts

A collection of scripts in the dataset/ directory is responsible for processing this raw data:

dataset/build_arc_dataset.py
dataset/build_sudoku_dataset.py
dataset/build_maze_dataset.py

These scripts perform several key steps:

Parsing: They read the source files (JSON for ARC, CSV for Sudoku/Mazes) and extract the input/output pairs.
Augmentation: To improve generalization and expand the training set, they apply domain-specific data augmentations.
- ARC: Dihedral transformations (rotations, flips) and color permutations are applied to the grids. See dataset/common.py.
- Sudoku: Valid transformations like digit remapping and permutation of rows/columns within bands/stacks are applied. See shuffle_sudoku in dataset/build_sudoku_dataset.py.
Serialization: The processed data, including all augmentations, is converted into a set of NumPy (.npy) files.

3. The Processed Dataset Format

After a build script is run, it creates an output directory (e.g., data/arc-aug-1000/) containing subdirectories for train and test splits. Each split directory contains the following files:

inputs.npy: A 2D array of [num_examples, seq_len] containing the flattened input grids.
labels.npy: A 2D array of [num_examples, seq_len] containing the flattened output grids.
puzzle_identifiers.npy: An array mapping each puzzle to a unique integer ID.
puzzle_indices.npy: An index array that maps a puzzle ID to the range of its corresponding examples in inputs.npy and labels.npy.
group_indices.npy: An index array that groups different augmentations of the same original puzzle together.
dataset.json: A metadata file containing information like sequence length, vocabulary size, and other dataset statistics.
identifiers.json: A global mapping from integer puzzle IDs back to their original string names.

This structure is optimized for fast, memory-mapped loading during training.

4. Data Loading in Training

The puzzle_dataset.py script defines the PuzzleDataset class, which is a PyTorch IterableDataset responsible for loading the processed .npy files.

Memory Efficiency: It uses mmap_mode='r' to load the large inputs and labels arrays, keeping memory usage low.
Training Mode: During training (test_set_mode=False), it samples by "groups." This ensures that in each epoch, the model sees a random augmentation of each puzzle, promoting better generalization.
Evaluation Mode: During evaluation (test_set_mode=True), it iterates through all examples sequentially to ensure a complete and consistent evaluation.

5. Visualizing the Dataset

To help inspect and debug the processed data, the repository includes puzzle_visualizer.html. This is a self-contained HTML file that can load a processed dataset directory locally in your browser. It allows you to navigate through groups and puzzles and see the input/output grids for each example, which is useful for verifying that augmentations and processing steps were applied correctly.