Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
翻译:大规模扩散模型在图像、视频和音频等多种模态中展现出卓越的生成能力。然而,文本到语音(TTS)系统通常需要引入领域特定的建模要素(如音素和音素级时长)以确保文本与语音间精确的时间对齐,这阻碍了扩散模型在TTS任务中的效率与可扩展性。本研究提出一种高效可扩展的扩散Transformer(DiT),其利用现成的预训练文本与语音编码器。该方法通过结合语音表征总时长的预测机制,借助交叉注意力解决文本-语音对齐的挑战。为此,我们改进DiT架构以适应TTS任务,并通过在语音隐空间中融入语义指导来提升对齐精度。我们将训练数据集与模型规模分别扩展至8.2万小时与7.9亿参数。大量实验表明,无需领域特定建模的大规模TTS扩散模型不仅简化了训练流程,而且在自然度、清晰度与说话人相似度方面,其零样本性能优于或媲美当前最先进的TTS模型。语音样本请访问:https://ditto-tts.github.io。