Audio Alignments
Piper includes experimental support for audio alignments, which expose the duration (in audio samples) of each phoneme ID used during synthesis. This feature is useful for applications like synchronizing synthesized speech with mouth movements (visemes) in animations.
Patching Voices for Alignments
To enable alignment data, you must first "patch" the voice's ONNX model file. This requires the onnx
Python package.
-
Install the
onnx
package:pip install onnx
-
Run the patching script:
python3 -m piper.patch_voice_with_alignment /path/to/your/model.onnx
This script modifies the ONNX model to output alignment information. Once patched, the onnx
package is no longer required for synthesis. Patched models should remain compatible with older Piper installations, which will simply ignore the extra output.
Python API for Alignments
The AudioChunk
class in the Python API is extended with several fields when using a patched model:
phonemes
(list of str): The phonemes used to produce the audio chunk.phoneme_ids
(list of int): The phoneme IDs used by the model.phoneme_id_samples
(list of int): The number of audio samples corresponding to each phoneme ID.phoneme_alignments
(list of tuples): A list of(phoneme, sample_count)
tuples.
These fields will be None
if the voice model does not support alignments or if they are disabled in the SynthesisConfig
(include_alignments=False
).
Example usage:
from piper import PiperVoice
# Load a patched voice
voice = PiperVoice.load("./path/to/patched_model.onnx")
for chunk in voice.synthesize("Hello world."):
if chunk.phoneme_alignments:
for phoneme, duration_samples in chunk.phoneme_alignments:
duration_ms = (duration_samples / chunk.sample_rate) * 1000
print(f"Phoneme: {phoneme}, Duration: {duration_ms:.2f} ms")
C/C++ API for Alignments
The piper_audio_chunk
struct in the C API is extended with new fields to expose alignment data:
const char32_t *phonemes
: An array of phoneme codepoints.size_t num_phonemes
: The number of phoneme codepoints.const int *phoneme_ids
: An array of phoneme IDs.size_t num_phoneme_ids
: The number of phoneme IDs.const int *alignments
: An array of sample counts for each phoneme ID.size_t num_alignments
: The number of alignments.
The alignments
field will be empty if the voice does not support them. The phonemes
and phoneme_ids
fields are always present.
Interpreting Alignment Data in C/C++
The relationship between phonemes, IDs, and alignments is structured to handle cases where a single phoneme maps to multiple IDs.
phoneme_ids
: This array contains the sequence of IDs sent to the model, like[1, 0, id1, 0, id2, 0, ..., 2]
, where0
is padding,1
is start-of-sentence, and2
is end-of-sentence.phonemes
: This array aligns withphoneme_ids
. It looks like[p1, p1, 0, p2, p2, 0, ...]
where a phoneme codepoint is repeated for each of its corresponding IDs. Groups are separated by a0
.
To correlate a phoneme with its duration:
- Read a group of
N
identical codepoints from thephonemes
array until a0
is encountered. - The next
N
values inphoneme_ids
correspond to that phoneme. - The next
N
values inalignments
are the sample counts for those IDs. - Advance your position in the
phoneme_ids
andalignments
arrays byN
and repeat.