Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
翻译:韵律蕴含超越字面含义的丰富信息,对语音的可理解性至关重要。现有模型在短语切分和语调方面仍存在不足:不仅在合成结构复杂的长句时遗漏或错置停顿,还会产生不自然的语调。我们提出ProsodyFM,一种基于流匹配(FM)框架的韵律感知文本到语音合成(TTS)模型,旨在增强韵律的短语切分与语调表现。ProsodyFM引入两个核心组件:通过短语停顿编码器捕捉初始停顿位置,并结合时长预测器实现停顿时长的灵活调整;以及通过终端语调编码器学习语调形态标记库,配合新颖的音高处理器更鲁棒地建模人类感知的语调变化。ProsodyFM无需显式韵律标注即可训练,并能发掘广泛的停顿时长与语调模式。实验结果表明,相较于四种前沿(SOTA)模型,ProsodyFM能有效改善韵律的短语切分与语调表现,从而提升整体可理解性。分布外实验表明,这种韵律改进能进一步赋予ProsodyFM对未见复杂语句与说话人的优异泛化能力。案例研究直观展示了ProsodyFM在短语切分与语调方面强大而精细的可控性。