In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.
翻译:在表现性语音合成中,广泛采用潜在韵律表示来处理训练过程中的数据变异性。同一文本可能对应多种声学实现,这被称为文本到语音中的一对多映射问题。通过自动编码设置从目标信号中提取语句级、词级或音素级表示,以补充语音输入并简化映射。本文比较了不同粒度下的韵律嵌入,并考察了从文本中预测它们的情况。我们表明,语句级嵌入容量不足,而音素级嵌入在从文本中预测时容易引入不稳定性。词级表示则在容量和可预测性之间取得了平衡。结果,我们在LibriTTS数据集上,将合成语音与录音间的自然度差距缩小了90%,同时未牺牲可懂度。