Pheme: Efficient and Conversational Speech Generation

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

翻译：近年来，语音生成领域取得了显著进展，现已实现与真实人声几乎无法区分的一次性生成能力。将此类语音生成技术进步与大型语言模型相结合，有望彻底改变众多应用场景。然而，某些特定应用（例如辅助对话系统）需要兼具自然对话式语音生成能力与高效实时运行特性的工具。当前由分层神经音频编解码器驱动的最先进模型（如VALL-E和SoundStorm）依赖庞大的神经组件和大量训练数据方能获得良好性能。相较之下，MQTTS旨在利用较小规模的真实对话语音数据构建更紧凑的对话式文本转语音（TTS）模型，但其自回归特性导致推理延迟较高，因而限制了实时应用的可能性。为克服当前最先进TTS模型的局限性并充分发挥其优势，本文提出Pheme模型系列：1）提供紧凑且高性能的模型；2）支持并行语音生成；3）生成自然的对话式语音；4）可在较小规模对话数据上高效训练，数据需求降低超十倍，同时仍能匹配自回归TTS模型的质量。我们还证明，通过简单的师生蒸馏方法，仅依赖规模更大的教师模型生成的合成语音，便可在预训练Pheme检查点基础上显著提升单说话人场景的语音质量。相关音频样本与预训练模型已在线公开。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日