In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.
翻译:本研究提出了一种基于扩散模型的简单高效非自回归文本转语音系统,命名为SimpleSpeech。其简洁性体现在三个方面:(1) 仅需语音数据集训练,无需任何对齐信息;(2) 直接以纯文本作为输入,通过非自回归方式生成语音;(3) 在有限紧凑的隐空间中建模语音,降低了扩散模型的建模难度。具体而言,我们提出了一种新型标量量化语音编解码器,该模型将复杂语音信号有效映射到有限紧凑的标量隐空间。基于此编解码器,我们在其标量隐空间中应用了新型Transformer扩散模型。通过在4千小时纯语音数据集上训练,SimpleSpeech展现出自然的韵律表现和音色克隆能力。与先前大规模文本转语音模型相比,本模型在语音质量和生成速度方面均有显著提升。演示样本已公开发布。