Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.
翻译:近年来,基于流匹配训练的神经常微分方程(ODE)模型在零样本语音克隆任务中取得了令人瞩目的性能。然而,将标准高斯噪声假设为ODE的初始分布,会导致流匹配拟合目标之间存在大量交叉,这给模型训练带来了挑战,并增加了所学生成轨迹的曲率。这些弯曲的轨迹限制了ODE模型以较少步数生成理想样本的能力。本文提出了SF-Speech,一种基于ODE和上下文学习的新型语音克隆模型。与先前工作不同,SF-Speech采用一个轻量级多阶段模块来为ODE生成一个更具确定性的初始分布。在不引入任何额外损失函数的情况下,我们通过将ODE模型与所提模块联合训练,有效地拉直了其弯曲的反向轨迹。在不同规模数据集上的实验结果表明,SF-Speech超越了当前最先进的零样本TTS方法,并且仅需四分之一的求解器步数,其生成速度约为Voicebox和E2 TTS的3.7倍。音频样本可在演示页面获取\footnote{[在线] 访问:https://lixuyuan102.github.io/Demo/}。