Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.
翻译:将文本转语音扩展到大规模、多样化数据集已被证明在实现音色和语音风格泛化方面极为有效,尤其是在零样本TTS中。然而,先前的工作通常使用音频编解码器将语音编码为潜在表示,并通过自回归语言模型或扩散模型生成,这忽略了语音的内在本质,可能导致生成结果欠佳或不可控。我们认为语音可分解为多个属性(如内容、音色、韵律和相位),每个属性应使用具有适当归纳偏好的模块进行建模。基于此视角,我们精心设计了一个新型大规模零样本TTS系统——Mega-TTS,该系统利用大规模野生数据训练,并以不同方式建模不同属性:1)我们仍选择频谱图作为中间特征,而非音频编解码器编码的潜在表示,因为频谱图能很好地将相位与其他属性分离。相位可通过基于GAN的声码器合理构建,无需语言模型建模。2)由于音色是随时间缓慢变化的全局属性,我们使用全局向量对其建模。3)我们进一步采用基于VQGAN的声学模型生成频谱图,并使用潜在码语言模型拟合韵律分布,因为韵律在句子中随时间快速变化,而语言模型能同时捕捉局部和长程依赖。我们将Mega-TTS扩展到包含2万小时语音的多领域数据集,并在未见说话者上评估其性能。实验结果表明,由于每个模块的适当归纳偏好,Mega-TTS在零样本TTS、语音编辑和跨语言TTS任务中均超越现有最优TTS系统,具备卓越的自然度、鲁棒性和说话者相似度。音频样本见https://mega-tts.github.io/demo-page。