API Reference

This page provides a detailed reference for the classes and methods in the minbpe library. Note that this documentation is based on the provided source code for minbpe/basic.py. Other modules like minbpe.base are used but their source is not provided in this context.

minbpe.basic.BasicTokenizer

This is a minimal, byte-level Byte Pair Encoding tokenizer. It inherits from a Tokenizer base class (defined in minbpe.base).

class BasicTokenizer(Tokenizer):
    # ... methods ...

__init__(self)

Initializes a new instance of the BasicTokenizer.

train(self, text, vocab_size, verbose=False)

Trains the tokenizer on a given text to create a vocabulary of a specific size.

  • Parameters:
    • text (str): The raw text data to train on.
    • vocab_size (int): The target size of the vocabulary. Must be >= 256.
    • verbose (bool, optional): If True, prints progress information for each merge. Defaults to False.

After training, the tokenizer instance will have two important attributes populated:

  • self.merges (dict): A dictionary mapping merged pairs of token IDs (e.g., (101, 32)) to their new token ID (e.g., 256). This is used during encoding.
  • self.vocab (dict): A dictionary mapping all token IDs (from 0 to vocab_size - 1) to their corresponding byte sequences (e.g., 256: b'th'). This is used during decoding.

encode(self, text)

Encodes a given string into a list of token IDs.

  • Parameters:
    • text (str): The input string to tokenize.
  • Returns:
    • list[int]: A list of integer token IDs.

Process:

  1. The input string is converted into a list of integers (0-255) representing its UTF-8 bytes.
  2. The method repeatedly searches for the pair of adjacent tokens that has the lowest merge index (i.e., was learned earliest during training).
  3. It replaces this pair with the new token ID from self.merges.
  4. This continues until no more merges can be performed.

decode(self, ids)

Decodes a list of token IDs back into a string.

  • Parameters:
    • ids (list[int]): A list of token IDs.
  • Returns:
    • str: The decoded Python string.

Process:

  1. Each token ID in the list is looked up in self.vocab to get its byte sequence.
  2. All byte sequences are concatenated together.
  3. The final byte string is decoded into a UTF-8 string. Any decoding errors are replaced.

save(self, prefix)

Saves the tokenizer's state to files. This method is inherited from the Tokenizer base class. It saves two files:

  • {prefix}.model: Contains the merge rules.
  • {prefix}.vocab: Contains the vocabulary.

  • Parameters:

    • prefix (str): The file path and prefix for the saved files.

load(self, model_file)

Loads the tokenizer's state from a .model file. This method is inherited from the Tokenizer base class.

  • Parameters:
    • model_file (str): The path to the .model file.