🏋️ Training New Voices

Piper includes a complete toolchain for training new VITS models or fine-tuning existing ones. The training code is located in src/piper/train and is managed by PyTorch Lightning and LightningCLI.

1. Prerequisites

Before you begin, ensure you have the necessary system packages installed.

On Debian-based systems (like Ubuntu), you can install them with:

sudo apt-get update
sudo apt-get install build-essential cmake ninja-build

2. Setup

First, clone the repository and set up a Python virtual environment:

git clone https://github.com/OHF-voice/piper1-gpl.git
cd piper1-gpl
python3 -m venv .venv
source .venv/bin/activate

Next, install the training dependencies:

pip install -e '.[train]'

Then, build the required Cython extension for monotonic alignment:

./build_monotonic_align.sh

Finally, perform a development build of the main C extension:

python3 setup.py build_ext --inplace

3. Dataset Format

To train a model, you need a dataset consisting of audio files and a corresponding metadata file. The metadata file must be a CSV file using | as the delimiter, with the following format:

utt1.wav|Text for utterance 1.
utt2.wav|Text for utterance 2.
...

Column 1: The filename of the audio clip (any format supported by librosa). These files must be located in the directory specified by --data.audio_dir.
Column 2: The transcribed text for the audio clip. This text will be phonemized by espeak-ng.

4. Running the Training Script

You can start training by running the piper.train module. It is highly recommended to fine-tune from a pre-trained checkpoint, as this significantly speeds up training and improves quality.

Pre-trained checkpoints are available at huggingface.co/datasets/rhasspy/piper-checkpoints.

Here is an example command to start training:

python3 -m piper.train fit \
  --data.voice_name "my-custom-voice" \
  --data.csv_path /path/to/your/metadata.csv \
  --data.audio_dir /path/to/your/audio/ \
  --model.sample_rate 22050 \
  --data.espeak_voice "en-us" \
  --data.cache_dir /path/to/training_cache/ \
  --data.config_path /path/to/write/config.json \
  --data.batch_size 32 \
  --ckpt_path /path/to/downloaded/finetune.ckpt

Key Arguments:

--data.voice_name: A unique name for your voice.
--data.csv_path: Path to your metadata CSV file.
--data.audio_dir: Directory containing your audio files.
--model.sample_rate: The sample rate of your audio (usually 22050 Hz for Piper models).
--data.espeak_voice: The espeak-ng voice to use for phonemization (e.g., en-us, fr-fr). Find voices with espeak-ng --voices.
--data.cache_dir: A directory to cache processed artifacts like phonemes and trimmed audio.
--data.config_path: The path where the final voice configuration JSON file will be saved.
--data.batch_size: The training batch size. Adjust based on your GPU's VRAM.
--ckpt_path: Path to a pre-trained Piper checkpoint for fine-tuning.

For a full list of options, run python3 -m piper.train fit --help.

5. Exporting the Model

Once training is complete, the model will be saved as a PyTorch Lightning checkpoint (.ckpt). You need to export it to the ONNX format for use with Piper:

python3 -m piper.train.export_onnx \
  --checkpoint /path/to/your/checkpoint.ckpt \
  --output-file /path/to/your/model.onnx

To use the exported model with Piper, you need two files:

The ONNX model file (e.g., en_US-my-voice-medium.onnx).
The JSON config file generated during training (--data.config_path), which should be renamed to match the model (e.g., en_US-my-voice-medium.onnx.json).