In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.
翻译:近年来,语音生成领域取得了显著进展,现已实现与真实人声几乎无法区分的一次性生成能力。将此类语音生成技术进步与大型语言模型相结合,有望彻底改变众多应用场景。然而,某些特定应用(例如辅助对话系统)需要兼具自然对话式语音生成能力与高效实时运行特性的工具。当前由分层神经音频编解码器驱动的最先进模型(如VALL-E和SoundStorm)依赖庞大的神经组件和大量训练数据方能获得良好性能。相较之下,MQTTS旨在利用较小规模的真实对话语音数据构建更紧凑的对话式文本转语音(TTS)模型,但其自回归特性导致推理延迟较高,因而限制了实时应用的可能性。为克服当前最先进TTS模型的局限性并充分发挥其优势,本文提出Pheme模型系列:1)提供紧凑且高性能的模型;2)支持并行语音生成;3)生成自然的对话式语音;4)可在较小规模对话数据上高效训练,数据需求降低超十倍,同时仍能匹配自回归TTS模型的质量。我们还证明,通过简单的师生蒸馏方法,仅依赖规模更大的教师模型生成的合成语音,便可在预训练Pheme检查点基础上显著提升单说话人场景的语音质量。相关音频样本与预训练模型已在线公开。