Audio Alignments
Piper includes experimental support for audio alignments, which expose the duration (in audio samples) of each phoneme ID used during synthesis. This feature is useful for applications like synchronizing synthesized speech with mouth movements (visemes) in animations.
Patching Voices for Alignments
To enable alignment data, you must first "patch" the voice's ONNX model file. This requires the onnx Python package.
-
Install the
onnxpackage:pip install onnx -
Run the patching script:
python3 -m piper.patch_voice_with_alignment /path/to/your/model.onnx
This script modifies the ONNX model to output alignment information. Once patched, the onnx package is no longer required for synthesis. Patched models should remain compatible with older Piper installations, which will simply ignore the extra output.
Python API for Alignments
The AudioChunk class in the Python API is extended with several fields when using a patched model:
phonemes(list of str): The phonemes used to produce the audio chunk.phoneme_ids(list of int): The phoneme IDs used by the model.phoneme_id_samples(list of int): The number of audio samples corresponding to each phoneme ID.phoneme_alignments(list of tuples): A list of(phoneme, sample_count)tuples.
These fields will be None if the voice model does not support alignments or if they are disabled in the SynthesisConfig (include_alignments=False).
Example usage:
from piper import PiperVoice
# Load a patched voice
voice = PiperVoice.load("./path/to/patched_model.onnx")
for chunk in voice.synthesize("Hello world."):
if chunk.phoneme_alignments:
for phoneme, duration_samples in chunk.phoneme_alignments:
duration_ms = (duration_samples / chunk.sample_rate) * 1000
print(f"Phoneme: {phoneme}, Duration: {duration_ms:.2f} ms")
C/C++ API for Alignments
The piper_audio_chunk struct in the C API is extended with new fields to expose alignment data:
const char32_t *phonemes: An array of phoneme codepoints.size_t num_phonemes: The number of phoneme codepoints.const int *phoneme_ids: An array of phoneme IDs.size_t num_phoneme_ids: The number of phoneme IDs.const int *alignments: An array of sample counts for each phoneme ID.size_t num_alignments: The number of alignments.
The alignments field will be empty if the voice does not support them. The phonemes and phoneme_ids fields are always present.
Interpreting Alignment Data in C/C++
The relationship between phonemes, IDs, and alignments is structured to handle cases where a single phoneme maps to multiple IDs.
phoneme_ids: This array contains the sequence of IDs sent to the model, like[1, 0, id1, 0, id2, 0, ..., 2], where0is padding,1is start-of-sentence, and2is end-of-sentence.phonemes: This array aligns withphoneme_ids. It looks like[p1, p1, 0, p2, p2, 0, ...]where a phoneme codepoint is repeated for each of its corresponding IDs. Groups are separated by a0.
To correlate a phoneme with its duration:
- Read a group of
Nidentical codepoints from thephonemesarray until a0is encountered. - The next
Nvalues inphoneme_idscorrespond to that phoneme. - The next
Nvalues inalignmentsare the sample counts for those IDs. - Advance your position in the
phoneme_idsandalignmentsarrays byNand repeat.