API Reference
This page provides a detailed reference for the classes and methods in the minbpe
library. Note that this documentation is based on the provided source code for minbpe/basic.py
. Other modules like minbpe.base
are used but their source is not provided in this context.
minbpe.basic.BasicTokenizer
This is a minimal, byte-level Byte Pair Encoding tokenizer. It inherits from a Tokenizer
base class (defined in minbpe.base
).
class BasicTokenizer(Tokenizer):
# ... methods ...
__init__(self)
Initializes a new instance of the BasicTokenizer
.
train(self, text, vocab_size, verbose=False)
Trains the tokenizer on a given text to create a vocabulary of a specific size.
- Parameters:
text
(str): The raw text data to train on.vocab_size
(int): The target size of the vocabulary. Must be >= 256.verbose
(bool, optional): IfTrue
, prints progress information for each merge. Defaults toFalse
.
After training, the tokenizer instance will have two important attributes populated:
self.merges
(dict): A dictionary mapping merged pairs of token IDs (e.g.,(101, 32)
) to their new token ID (e.g.,256
). This is used during encoding.self.vocab
(dict): A dictionary mapping all token IDs (from 0 tovocab_size - 1
) to their corresponding byte sequences (e.g.,256: b'th'
). This is used during decoding.
encode(self, text)
Encodes a given string into a list of token IDs.
- Parameters:
text
(str): The input string to tokenize.
- Returns:
list[int]
: A list of integer token IDs.
Process:
- The input string is converted into a list of integers (0-255) representing its UTF-8 bytes.
- The method repeatedly searches for the pair of adjacent tokens that has the lowest merge index (i.e., was learned earliest during training).
- It replaces this pair with the new token ID from
self.merges
. - This continues until no more merges can be performed.
decode(self, ids)
Decodes a list of token IDs back into a string.
- Parameters:
ids
(list[int]): A list of token IDs.
- Returns:
str
: The decoded Python string.
Process:
- Each token ID in the list is looked up in
self.vocab
to get its byte sequence. - All byte sequences are concatenated together.
- The final byte string is decoded into a UTF-8 string. Any decoding errors are replaced.
save(self, prefix)
Saves the tokenizer's state to files. This method is inherited from the Tokenizer
base class. It saves two files:
{prefix}.model
: Contains the merge rules.-
{prefix}.vocab
: Contains the vocabulary. -
Parameters:
prefix
(str): The file path and prefix for the saved files.
load(self, model_file)
Loads the tokenizer's state from a .model
file. This method is inherited from the Tokenizer
base class.
- Parameters:
model_file
(str): The path to the.model
file.