ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

翻译：韵律蕴含超越文字字面意义的丰富信息，这对语音的可理解性至关重要。现有模型在短语切分与语调方面仍存在不足：它们不仅在合成具有复杂结构的长句时会遗漏或错误放置停顿，还会产生不自然的语调。我们提出ProsodyFM，一种基于流匹配（FM）框架的韵律感知文本到语音合成（TTS）模型，旨在提升韵律的短语切分与语调表现。ProsodyFM引入了两个关键组件：一个用于捕捉初始短语停顿位置的短语停顿编码器，其后接一个用于灵活调整停顿时长的时长预测器；以及一个终端语调编码器，它集成了一组语调形态标记并结合新颖的音高处理器，以对人类感知的语调变化进行更鲁棒的建模。ProsodyFM在训练时无需显式的韵律标注，却能揭示广泛的停顿时长与语调模式。实验结果表明，与四种先进（SOTA）模型相比，ProsodyFM能有效改善韵律的短语切分与语调表现，从而提升整体可理解性。分布外实验表明，这种韵律改进能进一步赋予ProsodyFM对未见复杂句子和说话人更优的泛化能力。我们的案例研究直观地展示了ProsodyFM在短语切分与语调方面强大且细粒度的可控性。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日