This paper presents our work on phrase break prediction in the context of end-to-end TTS systems, motivated by the following questions: (i) Is there any utility in incorporating an explicit phrasing model in an end-to-end TTS system?, and (ii) How do you evaluate the effectiveness of a phrasing model in an end-to-end TTS system? In particular, the utility and effectiveness of phrase break prediction models are evaluated in in the context of childrens story synthesis, using listener comprehension. We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks using a trained phrasing model, over stories directly synthesized without predicting the location of phrase breaks.
翻译:本文针对端到端TTS系统中的短语切分预测展开研究,旨在探讨以下问题:(一)在端到端TTS系统中引入显式短语建模是否具有实用价值?(二)如何评估端到端TTS系统中短语模型的有效性?我们以儿童故事合成为应用场景,通过听者理解度评估短语切分预测模型的实用性和有效性。感知听力评测结果表明,相较于直接合成未预测短语切分位置的故事,听众对基于训练后的短语模型预测切分位置所合成的故事具有显著偏好。