Installation Guide
minbpe
is a standalone Python library with a few dependencies. The recommended way to install it is by cloning the repository and then installing the required packages using pip
.
Prerequisites
- Python 3.6 or higher
git
for cloning the repository
Steps
-
Clone the Repository
Open your terminal and clone the
minbpe
repository from GitHub:git clone https://github.com/karpathy/minbpe.git
-
Navigate to the Project Directory
Change into the newly created directory:
cd minbpe
-
Install Dependencies
The project's dependencies are listed in the
requirements.txt
file. Install them usingpip
:pip install -r requirements.txt
This will install the following packages:
regex
: A third-party regular expression library. This is likely used by theRegexTokenizer
mentioned intrain.py
for more advanced text splitting patterns.tiktoken
: OpenAI's fast BPE tokenizer library. This is likely used by theGPT4Tokenizer
(referenced inminbpe/__init__.py
) to load and use pre-trained tokenizers like GPT-4's.
Once the dependencies are installed, you are ready to use the library and run the training scripts. You can verify the installation by running the example training script:
python train.py
Note that the provided train.py
script attempts to use both BasicTokenizer
and RegexTokenizer
. If you only have the source code for BasicTokenizer
, you may need to modify the script to run successfully.