How It Works: The BPE Algorithm

minbpe provides a clear implementation of the Byte Pair Encoding (BPE) algorithm. This page breaks down the process as implemented in the BasicTokenizer class, which is the foundational tokenizer in this project.

The Core Idea

The BPE algorithm starts with a small base vocabulary and learns merge rules to create new, larger tokens that represent common sequences of characters. In minbpe, the process is byte-level, meaning the base vocabulary consists of the 256 possible byte values.

The training process is an iterative loop that finds the most frequent pair of adjacent tokens and merges them into a single new token.

Step-by-Step Training Process

The logic can be found in the train method of minbpe/basic.py. Here's a conceptual breakdown:

1. Initialization

Input: The training process takes a large string of text and a vocab_size (the final number of tokens you want in your vocabulary).
Preprocessing: The input text is first encoded into a sequence of bytes using UTF-8. This sequence of bytes is then converted into a list of integers, where each integer is between 0 and 255.
```
text_bytes = text.encode("utf-8") # raw bytes
ids = list(text_bytes) # list of integers in range 0..255
```
Base Vocabulary: The initial vocabulary is created, mapping each of the 256 possible byte values to its byte representation. For example, vocab[65] would be b'A'.

2. Iterative Merging

The core of the algorithm is a loop that runs for num_merges = vocab_size - 256 iterations. In each iteration, the following happens:

Find Most Frequent Pair: The code scans the entire list of token ids and counts the occurrences of all adjacent pairs. This is handled by the get_stats(ids) function. It returns a dictionary where keys are tuples of pairs (e.g., (101, 32)) and values are their frequencies.
Select Best Pair: The pair with the highest frequency is selected for merging. For example, if the pair for ('t', 'h') is most common, it will be chosen.
```
pair = max(stats, key=stats.get)
```
Create a New Token: A new token ID is created. The new IDs start from 256 upwards. For the first merge, the new ID is 256; for the second, it's 257, and so on.
```
idx = 256 + i
```
Perform the Merge: The code iterates through the ids list and replaces every occurrence of the chosen pair with the new idx. This is handled by the merge(ids, pair, idx) function.
Store the Merge Rule: The merge operation is stored in a merges dictionary. This dictionary maps the pair to the new token ID (e.g., merges[(116, 104)] = 256). This is crucial for the encode function later.
Update Vocabulary: The new token is added to the vocabulary. Its value is the concatenation of the bytes of the two tokens that formed it. For example, vocab[256] would become b'th'.

This loop continues until the desired vocabulary size is reached.

3. Encoding and Decoding

Encoding: To encode a new string, it's first converted to a list of byte IDs. Then, the merges are applied in the same order they were learned. The algorithm repeatedly finds the pair in the sequence that was merged earliest during training (has the lowest merge index) and replaces it with the corresponding new token ID. This continues until no more merges can be applied.
Decoding: Decoding is simpler. It's a lookup process. Each token ID in a sequence is looked up in the final vocab dictionary to retrieve its corresponding bytes. These bytes are then concatenated and decoded back into a UTF-8 string.