Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.
翻译:近期,基于VAE隐变量或梅尔频谱的扩散模型已成为零样本TTS的主流范式。尽管这些压缩表示提升了生成效率,但不可避免地存在信息损失和非端到端训练的问题。理论上,直接对原始波形进行建模可以规避这些问题;然而,由于音频信号极长的序列长度,这一方向仍未得到充分探索且常被认为具有挑战性。为此,我们提出WavTTS——首个原始波形生成式TTS模型,显著缩小了与隐空间生成模型的差距。WavTTS基于扩散Transformer(DiT)的流匹配方法,通过简单的分块策略直接建模语音波形,同时集成多尺度梅尔频谱监督以在训练过程中提供感知引导。此外,我们研究了预测目标与噪声调度对波形扩散的影响,并设计了一种有效的调度方案以提升生成质量。基于开源基准的评估表明,WavTTS的性能接近当前最先进的隐空间生成式零样本TTS模型,同时显著优于以往的端到端语音生成模型。我们的研究证明了直接在波形空间扩展基于扩散的TTS的可行性,为端到端语音生成开辟了新方向。