Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.
翻译:神经文本转语音(TTS)系统在语音助手、在线教育和有声读物创作等领域具有广泛应用。以扩散模型(DMs)为代表的现代模型有望实现高保真实时语音合成,但扩散模型中的多步采样效率问题带来了挑战。研究者尝试将生成对抗网络(GANs)与扩散模型结合,通过近似去噪分布加速推理,但对抗训练引入了模型收敛问题。为此,我们提出CM-TTS——一种基于一致性模型(CMs)的新型架构。受连续时间扩散模型启发,CM-TTS在不依赖对抗训练或预训练模型的情况下,仅需较少采样步数即可实现顶级质量的语音合成。我们进一步设计加权采样器,通过动态概率将不同采样位置纳入模型训练,确保整个训练过程中无偏学习。本文提出一种实时梅尔谱图生成一致性模型,并通过全面评估验证其有效性。实验结果表明,CM-TTS在性能上显著超越现有单步语音合成系统,标志着该领域的重要进展。