In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/
翻译:在文本到语音(TTS)合成中,扩散模型已取得了令人瞩目的生成质量。然而,由于预定义的数据到噪声扩散过程,其先验分布局限于包含少量生成目标信息的噪声表示。本文提出了一种新颖的TTS系统——Bridge-TTS,首次尝试将现有基于扩散的TTS方法中的噪声高斯先验替换为能提供目标强结构信息的干净确定性先验。具体而言,我们利用文本输入获得的潜在表示作为先验,在其与真实梅尔频谱图之间构建完全可解的Schrödinger桥,从而形成数据到数据处理过程。此外,我们公式化的可解性与灵活性使我们能够实证研究噪声调度等设计空间,并开发随机与确定性采样器。在LJ-Speech数据集上的实验结果表明,本方法在合成质量与采样效率方面均具有有效性,在50步/1000步合成中显著优于扩散方法Grad-TTS,并在少步数场景下超越强基线快速TTS模型。项目主页:https://bridge-tts.github.io/