How It Works: The BPE Algorithm
minbpe
provides a clear implementation of the Byte Pair Encoding (BPE) algorithm. This page breaks down the process as implemented in the BasicTokenizer
class, which is the foundational tokenizer in this project.
The Core Idea
The BPE algorithm starts with a small base vocabulary and learns merge rules to create new, larger tokens that represent common sequences of characters. In minbpe
, the process is byte-level, meaning the base vocabulary consists of the 256 possible byte values.
The training process is an iterative loop that finds the most frequent pair of adjacent tokens and merges them into a single new token.
Step-by-Step Training Process
The logic can be found in the train
method of minbpe/basic.py
. Here's a conceptual breakdown:
1. Initialization
- Input: The training process takes a large string of text and a
vocab_size
(the final number of tokens you want in your vocabulary). -
Preprocessing: The input text is first encoded into a sequence of bytes using UTF-8. This sequence of bytes is then converted into a list of integers, where each integer is between 0 and 255.
text_bytes = text.encode("utf-8") # raw bytes ids = list(text_bytes) # list of integers in range 0..255
-
Base Vocabulary: The initial vocabulary is created, mapping each of the 256 possible byte values to its byte representation. For example,
vocab[65]
would beb'A'
.
2. Iterative Merging
The core of the algorithm is a loop that runs for num_merges = vocab_size - 256
iterations. In each iteration, the following happens:
-
Find Most Frequent Pair: The code scans the entire list of token
ids
and counts the occurrences of all adjacent pairs. This is handled by theget_stats(ids)
function. It returns a dictionary where keys are tuples of pairs (e.g.,(101, 32)
) and values are their frequencies. -
Select Best Pair: The pair with the highest frequency is selected for merging. For example, if the pair for
('t', 'h')
is most common, it will be chosen.pair = max(stats, key=stats.get)
-
Create a New Token: A new token ID is created. The new IDs start from 256 upwards. For the first merge, the new ID is 256; for the second, it's 257, and so on.
idx = 256 + i
-
Perform the Merge: The code iterates through the
ids
list and replaces every occurrence of the chosenpair
with the newidx
. This is handled by themerge(ids, pair, idx)
function. -
Store the Merge Rule: The merge operation is stored in a
merges
dictionary. This dictionary maps the pair to the new token ID (e.g.,merges[(116, 104)] = 256
). This is crucial for theencode
function later. -
Update Vocabulary: The new token is added to the vocabulary. Its value is the concatenation of the bytes of the two tokens that formed it. For example,
vocab[256]
would becomeb'th'
.
This loop continues until the desired vocabulary size is reached.
3. Encoding and Decoding
-
Encoding: To encode a new string, it's first converted to a list of byte IDs. Then, the
merges
are applied in the same order they were learned. The algorithm repeatedly finds the pair in the sequence that was merged earliest during training (has the lowest merge index) and replaces it with the corresponding new token ID. This continues until no more merges can be applied. -
Decoding: Decoding is simpler. It's a lookup process. Each token ID in a sequence is looked up in the final
vocab
dictionary to retrieve its corresponding bytes. These bytes are then concatenated and decoded back into a UTF-8 string.