Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

翻译：将文本转语音扩展到大规模、多样化数据集已被证明在实现音色和语音风格泛化方面极为有效，尤其是在零样本TTS中。然而，先前的工作通常使用音频编解码器将语音编码为潜在表示，并通过自回归语言模型或扩散模型生成，这忽略了语音的内在本质，可能导致生成结果欠佳或不可控。我们认为语音可分解为多个属性（如内容、音色、韵律和相位），每个属性应使用具有适当归纳偏好的模块进行建模。基于此视角，我们精心设计了一个新型大规模零样本TTS系统——Mega-TTS，该系统利用大规模野生数据训练，并以不同方式建模不同属性：1）我们仍选择频谱图作为中间特征，而非音频编解码器编码的潜在表示，因为频谱图能很好地将相位与其他属性分离。相位可通过基于GAN的声码器合理构建，无需语言模型建模。2）由于音色是随时间缓慢变化的全局属性，我们使用全局向量对其建模。3）我们进一步采用基于VQGAN的声学模型生成频谱图，并使用潜在码语言模型拟合韵律分布，因为韵律在句子中随时间快速变化，而语言模型能同时捕捉局部和长程依赖。我们将Mega-TTS扩展到包含2万小时语音的多领域数据集，并在未见说话者上评估其性能。实验结果表明，由于每个模块的适当归纳偏好，Mega-TTS在零样本TTS、语音编辑和跨语言TTS任务中均超越现有最优TTS系统，具备卓越的自然度、鲁棒性和说话者相似度。音频样本见https://mega-tts.github.io/demo-page。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。