Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural speech synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we mainly focus on the neural network in diffusion-model-based Text-to-Speech (TTS) systems and propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The modular design of the U-DiT architecture, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales. The proposed U-DiT TTS system is a mel spectrogram-based acoustic model and utilizes a pretrained HiFi-GAN as the vocoder. The objective (ie Frechet distance) and MOS results show that our DiT-TTS system achieves state-of-art performance on the single speaker dataset LJSpeech. Our demos are publicly available at: https://eihw.github.io/u-dit-tts/
翻译:[translated abstract in Chinese]
深度学习推动了文本到语音合成领域的重大进展。近期,基于分数的生成模型(Score-based Generative Models, SGMs),即扩散概率模型(Diffusion Probabilistic Models, DPMs),因其在神经语音合成系统中能够生成高质量合成神经语音的能力而受到广泛关注。在SGMs中,U-Net架构及其变体自首次成功应用以来,长期占据骨干网络的主导地位。本研究聚焦于基于扩散模型的文本到语音合成(Text-to-Speech, TTS)系统中的神经网络,提出U-DiT架构,探索视觉Transformer架构作为TTS系统中扩散模型核心组件的潜力。U-DiT架构的模块化设计融合了U-Net与ViT的最优特性,使其在不同规模数据上具备卓越的可扩展性与适用性。所提出的U-DiT TTS系统是一种基于梅尔频谱图的声学模型,并采用预训练的HiFi-GAN作为声码器。客观指标(即弗雷歇距离)和主观意见得分(MOS)结果表明,我们的DiT-TTS系统在单说话人数据集LJSpeech上达到了当前最优性能。相关演示音频已公开发布于:https://eihw.github.io/u-dit-tts/