Installation Guide
minbpe is a standalone Python library with a few dependencies. The recommended way to install it is by cloning the repository and then installing the required packages using pip.
Prerequisites
- Python 3.6 or higher
gitfor cloning the repository
Steps
-
Clone the Repository
Open your terminal and clone the
minbperepository from GitHub:git clone https://github.com/karpathy/minbpe.git -
Navigate to the Project Directory
Change into the newly created directory:
cd minbpe -
Install Dependencies
The project's dependencies are listed in the
requirements.txtfile. Install them usingpip:pip install -r requirements.txtThis will install the following packages:
regex: A third-party regular expression library. This is likely used by theRegexTokenizermentioned intrain.pyfor more advanced text splitting patterns.tiktoken: OpenAI's fast BPE tokenizer library. This is likely used by theGPT4Tokenizer(referenced inminbpe/__init__.py) to load and use pre-trained tokenizers like GPT-4's.
Once the dependencies are installed, you are ready to use the library and run the training scripts. You can verify the installation by running the example training script:
python train.py
Note that the provided train.py script attempts to use both BasicTokenizer and RegexTokenizer. If you only have the source code for BasicTokenizer, you may need to modify the script to run successfully.