MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts. Our samples are available at https://nc-ai.github.io/speech/publications/multiverse/index.html

翻译：通过扩大训练数据规模，文本转语音（TTS）系统在零样本语音合成方面取得了显著进展。然而，这些系统存在一定的局限性：它们需要大量训练数据，从而增加了成本，并且常常忽略韵律相似性。为了解决这些问题，我们提出了MultiVerse，一个零样本多任务TTS系统，能够在零样本和跨语言条件下执行TTS或语音风格转换。MultiVerse所需训练数据远少于传统数据驱动方法。为了确保即使在有限数据下也能实现零样本性能，我们利用基于源-滤波器理论的解耦方法，利用提示来建模与滤波器相关和与声源相关的表征。此外，为了进一步增强韵律相似性，我们采用了一种结合基于提示的自回归与非自回归方法的韵律建模策略。评估结果表明，MultiVerse具有卓越的零样本多任务TTS性能，它不仅能够以少得多的数据实现与数据驱动TTS系统相媲美的零样本TTS性能，而且在相同少量数据训练下，其表现显著优于其他零样本TTS系统。特别地，我们新颖的韵律建模技术为MultiVerse生成与给定提示具有高韵律相似性的语音做出了重要贡献。我们的样本可在 https://nc-ai.github.io/speech/publications/multiverse/index.html 获取。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日