Audio Alignments

Piper includes experimental support for audio alignments, which expose the duration (in audio samples) of each phoneme ID used during synthesis. This feature is useful for applications like synchronizing synthesized speech with mouth movements (visemes) in animations.

Patching Voices for Alignments

To enable alignment data, you must first "patch" the voice's ONNX model file. This requires the onnx Python package.

  1. Install the onnx package:

    pip install onnx

  2. Run the patching script:

    python3 -m piper.patch_voice_with_alignment /path/to/your/model.onnx

This script modifies the ONNX model to output alignment information. Once patched, the onnx package is no longer required for synthesis. Patched models should remain compatible with older Piper installations, which will simply ignore the extra output.

Python API for Alignments

The AudioChunk class in the Python API is extended with several fields when using a patched model:

  • phonemes (list of str): The phonemes used to produce the audio chunk.
  • phoneme_ids (list of int): The phoneme IDs used by the model.
  • phoneme_id_samples (list of int): The number of audio samples corresponding to each phoneme ID.
  • phoneme_alignments (list of tuples): A list of (phoneme, sample_count) tuples.

These fields will be None if the voice model does not support alignments or if they are disabled in the SynthesisConfig (include_alignments=False).

Example usage:

from piper import PiperVoice

# Load a patched voice
voice = PiperVoice.load("./path/to/patched_model.onnx")

for chunk in voice.synthesize("Hello world."):
    if chunk.phoneme_alignments:
        for phoneme, duration_samples in chunk.phoneme_alignments:
            duration_ms = (duration_samples / chunk.sample_rate) * 1000
            print(f"Phoneme: {phoneme}, Duration: {duration_ms:.2f} ms")

C/C++ API for Alignments

The piper_audio_chunk struct in the C API is extended with new fields to expose alignment data:

  • const char32_t *phonemes: An array of phoneme codepoints.
  • size_t num_phonemes: The number of phoneme codepoints.
  • const int *phoneme_ids: An array of phoneme IDs.
  • size_t num_phoneme_ids: The number of phoneme IDs.
  • const int *alignments: An array of sample counts for each phoneme ID.
  • size_t num_alignments: The number of alignments.

The alignments field will be empty if the voice does not support them. The phonemes and phoneme_ids fields are always present.

Interpreting Alignment Data in C/C++

The relationship between phonemes, IDs, and alignments is structured to handle cases where a single phoneme maps to multiple IDs.

  • phoneme_ids: This array contains the sequence of IDs sent to the model, like [1, 0, id1, 0, id2, 0, ..., 2], where 0 is padding, 1 is start-of-sentence, and 2 is end-of-sentence.
  • phonemes: This array aligns with phoneme_ids. It looks like [p1, p1, 0, p2, p2, 0, ...] where a phoneme codepoint is repeated for each of its corresponding IDs. Groups are separated by a 0.

To correlate a phoneme with its duration:

  1. Read a group of N identical codepoints from the phonemes array until a 0 is encountered.
  2. The next N values in phoneme_ids correspond to that phoneme.
  3. The next N values in alignments are the sample counts for those IDs.
  4. Advance your position in the phoneme_ids and alignments arrays by N and repeat.