🏋️ Training New Voices
Piper includes a complete toolchain for training new VITS models or fine-tuning existing ones. The training code is located in src/piper/train
and is managed by PyTorch Lightning and LightningCLI
.
1. Prerequisites
Before you begin, ensure you have the necessary system packages installed.
On Debian-based systems (like Ubuntu), you can install them with:
sudo apt-get update
sudo apt-get install build-essential cmake ninja-build
2. Setup
First, clone the repository and set up a Python virtual environment:
git clone https://github.com/OHF-voice/piper1-gpl.git
cd piper1-gpl
python3 -m venv .venv
source .venv/bin/activate
Next, install the training dependencies:
pip install -e '.[train]'
Then, build the required Cython extension for monotonic alignment:
./build_monotonic_align.sh
Finally, perform a development build of the main C extension:
python3 setup.py build_ext --inplace
3. Dataset Format
To train a model, you need a dataset consisting of audio files and a corresponding metadata file. The metadata file must be a CSV file using |
as the delimiter, with the following format:
utt1.wav|Text for utterance 1.
utt2.wav|Text for utterance 2.
...
- Column 1: The filename of the audio clip (any format supported by librosa). These files must be located in the directory specified by
--data.audio_dir
. - Column 2: The transcribed text for the audio clip. This text will be phonemized by espeak-ng.
4. Running the Training Script
You can start training by running the piper.train
module. It is highly recommended to fine-tune from a pre-trained checkpoint, as this significantly speeds up training and improves quality.
Pre-trained checkpoints are available at huggingface.co/datasets/rhasspy/piper-checkpoints.
Here is an example command to start training:
python3 -m piper.train fit \
--data.voice_name "my-custom-voice" \
--data.csv_path /path/to/your/metadata.csv \
--data.audio_dir /path/to/your/audio/ \
--model.sample_rate 22050 \
--data.espeak_voice "en-us" \
--data.cache_dir /path/to/training_cache/ \
--data.config_path /path/to/write/config.json \
--data.batch_size 32 \
--ckpt_path /path/to/downloaded/finetune.ckpt
Key Arguments:
--data.voice_name
: A unique name for your voice.--data.csv_path
: Path to your metadata CSV file.--data.audio_dir
: Directory containing your audio files.--model.sample_rate
: The sample rate of your audio (usually 22050 Hz for Piper models).--data.espeak_voice
: The espeak-ng voice to use for phonemization (e.g.,en-us
,fr-fr
). Find voices withespeak-ng --voices
.--data.cache_dir
: A directory to cache processed artifacts like phonemes and trimmed audio.--data.config_path
: The path where the final voice configuration JSON file will be saved.--data.batch_size
: The training batch size. Adjust based on your GPU's VRAM.--ckpt_path
: Path to a pre-trained Piper checkpoint for fine-tuning.
For a full list of options, run python3 -m piper.train fit --help
.
5. Exporting the Model
Once training is complete, the model will be saved as a PyTorch Lightning checkpoint (.ckpt
). You need to export it to the ONNX format for use with Piper:
python3 -m piper.train.export_onnx \
--checkpoint /path/to/your/checkpoint.ckpt \
--output-file /path/to/your/model.onnx
To use the exported model with Piper, you need two files:
- The ONNX model file (e.g.,
en_US-my-voice-medium.onnx
). - The JSON config file generated during training (
--data.config_path
), which should be renamed to match the model (e.g.,en_US-my-voice-medium.onnx.json
).