We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.
翻译:我们提出浅层流匹配(SFM),这是一种新颖的机制,用于在从粗到细的生成范式中增强基于流匹配(FM)的文本到语音(TTS)模型。与使用弱生成器的粗略表示作为条件的传统FM模块不同,SFM从这些表示出发,沿着FM路径构建中间状态。在训练过程中,我们引入了一种正交投影方法来自适应地确定这些状态的时间位置,并应用了一种基于单段分段流的原理性构建策略。SFM推理从中间状态而非纯噪声开始,从而将计算集中在FM路径的后期阶段。我们将SFM集成到多个TTS模型中,并配备一个轻量级的SFM头。实验表明,SFM在客观和主观评估中均能持续提升语音的自然度,并且在使用自适应步长ODE求解器时能显著加速推理。演示和代码可在 https://ydqmkkx.github.io/SFMDemo/ 获取。