minbpe: Minimal Byte Pair Encoding

Welcome to the documentation for minbpe, a minimal, educational implementation of the Byte Pair Encoding (BPE) algorithm for text tokenization. This project provides a clear and straightforward implementation of BPE, making it an excellent resource for learning how modern tokenizers work under the hood.

Project Philosophy

The goal of minbpe is to provide simple, readable code that demystifies the tokenization process used in Large Language Models. The primary implementation, BasicTokenizer, is designed to be algorithmically similar to the tokenizer used in GPT-2, but without the complexities of regular expression splitting or special token handling.

This makes it an ideal tool for developers and researchers who want to:

  • Understand the BPE algorithm from first principles.
  • Train custom tokenizers on small- to medium-sized datasets.
  • Experiment with the tokenization process in a controlled environment.

Core Features

  • BasicTokenizer: A pure, byte-level BPE implementation. It starts with the 256 raw bytes as the initial vocabulary and iteratively merges the most frequent pairs.
  • Training from Text: Easily train a new tokenizer on any text file. The train.py script provides a clear example of this process.
  • Serialization: Save your trained tokenizer's vocabulary and merge rules to disk and load them back for later use.
  • Encode/Decode: Convert strings to token IDs and back, the fundamental operations of a tokenizer.

Example Usage

The following is a brief example of how to use minbpe to train a tokenizer and use it for encoding and decoding text. For a more detailed guide, see the Quick Start page.

import os
from minbpe.basic import BasicTokenizer

# Sample text for training
text = """
    A minimal (byte-level) Byte Pair Encoding tokenizer.
    Algorithmically follows along the GPT tokenizer.
"""

# 1. Initialize and train the tokenizer
vocab_size = 276 # 256 bytes + 20 merges
tokenizer = BasicTokenizer()
tokenizer.train(text, vocab_size)

# 2. Save the trained model
_ = tokenizer.save('models/minbpe_basic')

# 3. Encode and decode text
encoded_text = tokenizer.encode("hello world")
decoded_text = tokenizer.decode(encoded_text)

print(f"Encoded: {encoded_text}")
print(f"Decoded: {decoded_text}")

This project also includes references to other tokenizer implementations in its __init__.py and train.py files, such as RegexTokenizer, which likely adds regex-based pre-tokenization. While their source is not included in this context, their existence suggests pathways for extending the basic BPE algorithm.