U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural speech synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we mainly focus on the neural network in diffusion-model-based Text-to-Speech (TTS) systems and propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The modular design of the U-DiT architecture, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales. The proposed U-DiT TTS system is a mel spectrogram-based acoustic model and utilizes a pretrained HiFi-GAN as the vocoder. The objective (ie Frechet distance) and MOS results show that our DiT-TTS system achieves state-of-art performance on the single speaker dataset LJSpeech. Our demos are publicly available at: https://eihw.github.io/u-dit-tts/

翻译：[translated abstract in Chinese] 深度学习推动了文本到语音合成领域的重大进展。近期，基于分数的生成模型（Score-based Generative Models, SGMs），即扩散概率模型（Diffusion Probabilistic Models, DPMs），因其在神经语音合成系统中能够生成高质量合成神经语音的能力而受到广泛关注。在SGMs中，U-Net架构及其变体自首次成功应用以来，长期占据骨干网络的主导地位。本研究聚焦于基于扩散模型的文本到语音合成（Text-to-Speech, TTS）系统中的神经网络，提出U-DiT架构，探索视觉Transformer架构作为TTS系统中扩散模型核心组件的潜力。U-DiT架构的模块化设计融合了U-Net与ViT的最优特性，使其在不同规模数据上具备卓越的可扩展性与适用性。所提出的U-DiT TTS系统是一种基于梅尔频谱图的声学模型，并采用预训练的HiFi-GAN作为声码器。客观指标（即弗雷歇距离）和主观意见得分（MOS）结果表明，我们的DiT-TTS系统在单说话人数据集LJSpeech上达到了当前最优性能。相关演示音频已公开发布于：https://eihw.github.io/u-dit-tts/

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。