In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). The following contributions are made by DCTTS: 1) The TTS diffusion model based on discrete space significantly lowers the computational consumption of the diffusion model and improves sampling speed; 2) The contrastive learning method based on discrete space is used to enhance the alignment connection between speech and text and improve sampling quality; and 3) It uses an efficient text encoder to simplify the model's parameters and increase computational efficiency. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality and sampling speed while significantly reducing the resource consumption of diffusion model. The synthesized samples are available at https://github.com/lawtherWu/DCTTS.
翻译:在文本到语音(TTS)任务中,潜在扩散模型具有出色的保真度和泛化能力,但其昂贵的资源消耗和缓慢的推理速度始终是一个挑战。本文提出了一种基于对比学习的离散扩散模型用于文本到语音生成(DCTTS)。DCTTS 做出了以下贡献:1)基于离散空间的 TTS 扩散模型显著降低了扩散模型的计算消耗,并提高了采样速度;2)采用基于离散空间的对比学习方法增强语音与文本之间的对齐连接,提升采样质量;3)使用高效的文本编码器简化模型参数并提高计算效率。实验结果表明,本文提出的方法在显著降低扩散模型资源消耗的同时,具有出色的语音合成质量和采样速度。合成样本可通过 https://github.com/lawtherWu/DCTTS 获取。