Quick Start: Train and Use a Tokenizer
This guide will walk you through the entire process of training a new BasicTokenizer
, saving it, and then using it to encode and decode text. We will use the sample text file tests/taylorswift.txt
that comes with the project.
Step 1: Prepare Your Training Data
minbpe
can be trained on any plain text file. The project already includes a sample file at tests/taylorswift.txt
. We will use this file for our example.
Step 2: Write the Training Script
Create a new Python file named my_training.py
and add the following code. This script will perform the following actions:
- Load the training data from the text file.
- Initialize a
BasicTokenizer
. - Train the tokenizer to achieve a desired vocabulary size.
- Save the trained tokenizer's merge rules and vocabulary to a file.
import os
from minbpe.basic import BasicTokenizer
# Path to the training data
text_file_path = "tests/taylorswift.txt"
# --- 1. Load Data ---
print("Loading training data...")
with open(text_file_path, "r", encoding="utf-8") as f:
text = f.read()
# --- 2. Initialize and Train Tokenizer ---
print("Training tokenizer...")
vocab_size = 512 # The desired final vocabulary size
tokenizer = BasicTokenizer()
tokenizer.train(text, vocab_size, verbose=False) # Set verbose=True to see merges
# --- 3. Save the Trained Tokenizer ---
model_dir = "models"
model_prefix = os.path.join(model_dir, "my_basic_tokenizer")
os.makedirs(model_dir, exist_ok=True)
tokenizer.save(model_prefix)
print(f"Tokenizer trained and saved to {model_prefix}.model and {model_prefix}.vocab")
Step 3: Run the Script
Execute the script from your terminal. This will read the text, train the tokenizer, and create a models
directory containing my_basic_tokenizer.model
and my_basic_tokenizer.vocab
.
python my_training.py
Step 4: Load and Use the Tokenizer
Now that you have a trained tokenizer, you can use it in any other script. Let's create a new file use_tokenizer.py
to see how to load and use it.
from minbpe.basic import BasicTokenizer
# 1. Create a tokenizer instance and load the trained model
loaded_tokenizer = BasicTokenizer()
loaded_tokenizer.load("models/my_basic_tokenizer.model")
# 2. Define some text to encode
sample_text = "You Belong with Me"
# 3. Encode the text into token IDs
encoded_ids = loaded_tokenizer.encode(sample_text)
# 4. Decode the token IDs back into a string
decoded_text = loaded_tokenizer.decode(encoded_ids)
# --- Display the results ---
print(f"Original text: '{sample_text}'")
print(f"Encoded token IDs: {encoded_ids}")
print(f"Decoded text: '{decoded_text}'")
Run this script:
python use_tokenizer.py
You will see the original text converted into a list of integers (token IDs) and then perfectly converted back to the original string. You have successfully trained and used your first custom BPE tokenizer!