Quick Start: Train and Use a Tokenizer

This guide will walk you through the entire process of training a new BasicTokenizer, saving it, and then using it to encode and decode text. We will use the sample text file tests/taylorswift.txt that comes with the project.

Step 1: Prepare Your Training Data

minbpe can be trained on any plain text file. The project already includes a sample file at tests/taylorswift.txt. We will use this file for our example.

Step 2: Write the Training Script

Create a new Python file named my_training.py and add the following code. This script will perform the following actions:

  1. Load the training data from the text file.
  2. Initialize a BasicTokenizer.
  3. Train the tokenizer to achieve a desired vocabulary size.
  4. Save the trained tokenizer's merge rules and vocabulary to a file.
import os
from minbpe.basic import BasicTokenizer

# Path to the training data
text_file_path = "tests/taylorswift.txt"

# --- 1. Load Data ---
print("Loading training data...")
with open(text_file_path, "r", encoding="utf-8") as f:
    text = f.read()

# --- 2. Initialize and Train Tokenizer ---
print("Training tokenizer...")
vocab_size = 512 # The desired final vocabulary size
tokenizer = BasicTokenizer()
tokenizer.train(text, vocab_size, verbose=False) # Set verbose=True to see merges

# --- 3. Save the Trained Tokenizer ---
model_dir = "models"
model_prefix = os.path.join(model_dir, "my_basic_tokenizer")

os.makedirs(model_dir, exist_ok=True)
tokenizer.save(model_prefix)

print(f"Tokenizer trained and saved to {model_prefix}.model and {model_prefix}.vocab")

Step 3: Run the Script

Execute the script from your terminal. This will read the text, train the tokenizer, and create a models directory containing my_basic_tokenizer.model and my_basic_tokenizer.vocab.

python my_training.py

Step 4: Load and Use the Tokenizer

Now that you have a trained tokenizer, you can use it in any other script. Let's create a new file use_tokenizer.py to see how to load and use it.

from minbpe.basic import BasicTokenizer

# 1. Create a tokenizer instance and load the trained model
loaded_tokenizer = BasicTokenizer()
loaded_tokenizer.load("models/my_basic_tokenizer.model")

# 2. Define some text to encode
sample_text = "You Belong with Me"

# 3. Encode the text into token IDs
encoded_ids = loaded_tokenizer.encode(sample_text)

# 4. Decode the token IDs back into a string
decoded_text = loaded_tokenizer.decode(encoded_ids)

# --- Display the results ---
print(f"Original text: '{sample_text}'")
print(f"Encoded token IDs: {encoded_ids}")
print(f"Decoded text: '{decoded_text}'")

Run this script:

python use_tokenizer.py

You will see the original text converted into a list of integers (token IDs) and then perfectly converted back to the original string. You have successfully trained and used your first custom BPE tokenizer!