A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.
翻译:音素后验图(PPG)是一种随时间变化的类别分布,覆盖语音的声学单元(如音素)。由于能够将发音特征与说话人身份分离,PPG在语音生成领域被广泛应用,可实现对发音的精确重建(如语音转换)和粗粒度发音编辑(如外语口音转换)。本文通过实验显著提升了PPG的质量,生成了当前最先进的可解释PPG表征。我们利用该PPG表征训练了现成的语音合成器,并证明高质量PPG可实现音高与发音的独立控制。此外,本文进一步展示了PPG的新用途,例如声学发音距离度量及细粒度发音控制。