Diffusion-based Generative AI gains significant attention for its superior performance over other generative techniques like Generative Adversarial Networks and Variational Autoencoders. While it has achieved notable advancements in fields such as computer vision and natural language processing, their application in speech generation remains under-explored. Mainstream Text-to-Speech systems primarily map outputs to Mel-Spectrograms in the spectral space, leading to high computational loads due to the sparsity of MelSpecs. To address these limitations, we propose LatentSpeech, a novel TTS generation approach utilizing latent diffusion models. By using latent embeddings as the intermediate representation, LatentSpeech reduces the target dimension to 5% of what is required for MelSpecs, simplifying the processing for the TTS encoder and vocoder and enabling efficient high-quality speech generation. This study marks the first integration of latent diffusion models in TTS, enhancing the accuracy and naturalness of generated speech. Experimental results on benchmark datasets demonstrate that LatentSpeech achieves a 25% improvement in Word Error Rate and a 24% improvement in Mel Cepstral Distortion compared to existing models, with further improvements rising to 49.5% and 26%, respectively, with additional training data. These findings highlight the potential of LatentSpeech to advance the state-of-the-art in TTS technology
翻译:基于扩散的生成式人工智能因其性能优于生成对抗网络和变分自编码器等生成技术而备受关注。尽管该技术在计算机视觉和自然语言处理等领域已取得显著进展,但其在语音生成领域的应用仍待深入探索。主流的文本到语音系统主要将输出映射至频谱空间的梅尔频谱图,而梅尔频谱的稀疏性导致计算负载较高。为应对这些局限性,本文提出LatentSpeech——一种利用隐空间扩散模型的新型TTS生成方法。通过使用隐空间嵌入作为中间表示,LatentSpeech将目标维度降至梅尔频谱所需维度的5%,简化了TTS编码器和声码器的处理流程,实现了高效的高质量语音生成。本研究首次将隐空间扩散模型整合至TTS系统,提升了生成语音的准确度与自然度。在基准数据集上的实验结果表明:相较于现有模型,LatentSpeech的词错误率降低25%,梅尔倒谱失真改善24%;当使用额外训练数据时,这两项指标可进一步提升至49.5%和26%。这些发现彰显了LatentSpeech推动TTS技术前沿发展的潜力。