Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
翻译:韵律蕴含超越文字字面意义的丰富信息,这对语音的可理解性至关重要。现有模型在短语切分与语调方面仍存在不足:它们不仅在合成具有复杂结构的长句时会遗漏或错误放置停顿,还会产生不自然的语调。我们提出ProsodyFM,一种基于流匹配(FM)框架的韵律感知文本到语音合成(TTS)模型,旨在提升韵律的短语切分与语调表现。ProsodyFM引入了两个关键组件:一个用于捕捉初始短语停顿位置的短语停顿编码器,其后接一个用于灵活调整停顿时长的时长预测器;以及一个终端语调编码器,它集成了一组语调形态标记并结合新颖的音高处理器,以对人类感知的语调变化进行更鲁棒的建模。ProsodyFM在训练时无需显式的韵律标注,却能揭示广泛的停顿时长与语调模式。实验结果表明,与四种先进(SOTA)模型相比,ProsodyFM能有效改善韵律的短语切分与语调表现,从而提升整体可理解性。分布外实验表明,这种韵律改进能进一步赋予ProsodyFM对未见复杂句子和说话人更优的泛化能力。我们的案例研究直观地展示了ProsodyFM在短语切分与语调方面强大且细粒度的可控性。