Installation Guide

minbpe is a standalone Python library with a few dependencies. The recommended way to install it is by cloning the repository and then installing the required packages using pip.

Prerequisites

  • Python 3.6 or higher
  • git for cloning the repository

Steps

  1. Clone the Repository

    Open your terminal and clone the minbpe repository from GitHub:

    git clone https://github.com/karpathy/minbpe.git
  2. Navigate to the Project Directory

    Change into the newly created directory:

    cd minbpe
  3. Install Dependencies

    The project's dependencies are listed in the requirements.txt file. Install them using pip:

    pip install -r requirements.txt

    This will install the following packages:

    • regex: A third-party regular expression library. This is likely used by the RegexTokenizer mentioned in train.py for more advanced text splitting patterns.
    • tiktoken: OpenAI's fast BPE tokenizer library. This is likely used by the GPT4Tokenizer (referenced in minbpe/__init__.py) to load and use pre-trained tokenizers like GPT-4's.

Once the dependencies are installed, you are ready to use the library and run the training scripts. You can verify the installation by running the example training script:

python train.py

Note that the provided train.py script attempts to use both BasicTokenizer and RegexTokenizer. If you only have the source code for BasicTokenizer, you may need to modify the script to run successfully.