While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
翻译:尽管近期大规模文本转语音(TTS)模型取得了显著进展,但在语音质量、相似性和韵律方面仍存在不足。考虑到语音复杂地包含多种属性(如内容、韵律、音色和声学细节),对生成构成重大挑战,一个自然思路是将语音分解为表示不同属性的独立子空间并分别生成。受此启发,我们提出NaturalSpeech 3——一种采用新型分解扩散模型的零样本自然语音TTS系统。具体而言:1)我们设计了一种基于分解向量量化(FVQ)的神经编解码器,将语音波形解耦为内容、韵律、音色和声学细节四个子空间;2)提出分解扩散模型,在给定对应提示条件下生成各子空间属性。通过这种分解设计,NaturalSpeech 3能以分治方式有效建模复杂语音,实现解耦子空间生成。实验表明,NaturalSpeech 3在质量、相似性、韵律和可懂度方面均超越现有最优TTS系统,并达到与人类录音相当的质量水平。此外,通过扩展至10亿参数和20万小时训练数据,我们进一步取得了更优性能。