This paper presents our work on phrase break prediction in the context of end-to-end TTS systems, motivated by the following questions: (i) Is there any utility in incorporating an explicit phrasing model in an end-to-end TTS system?, and (ii) How do you evaluate the effectiveness of a phrasing model in an end-to-end TTS system? In particular, the utility and effectiveness of phrase break prediction models are evaluated in in the context of childrens story synthesis, using listener comprehension. We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks using a trained phrasing model, over stories directly synthesized without predicting the location of phrase breaks.
翻译:本文围绕端到端TTS系统中的短语切分预测工作展开研究,主要基于以下问题:(i) 在端到端TTS系统中引入显式短语建模是否具有实用价值?(ii) 如何评估端到端TTS系统中短语建模的有效性?特别地,本文以儿童故事合成为应用场景,通过听者理解度测试来评估短语切分预测模型的实用性和有效性。感知听觉评估结果表明:相较于直接合成而未预测短语切分位置的故事,听众对通过训练后的短语模型预测切分位置后再合成的故事表现出明显偏好。