SF-Speech：基于拉直流的零样本语音克隆 (SF-Speech: Straightened Flow for Zero-Shot Voice Clone)

Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.

翻译：近年来，基于流匹配训练的神经常微分方程（ODE）模型在零样本语音克隆任务中取得了令人瞩目的性能。然而，将标准高斯噪声假设为ODE的初始分布，会导致流匹配拟合目标之间存在大量交叉，这给模型训练带来了挑战，并增加了所学生成轨迹的曲率。这些弯曲的轨迹限制了ODE模型以较少步数生成理想样本的能力。本文提出了SF-Speech，一种基于ODE和上下文学习的新型语音克隆模型。与先前工作不同，SF-Speech采用一个轻量级多阶段模块来为ODE生成一个更具确定性的初始分布。在不引入任何额外损失函数的情况下，我们通过将ODE模型与所提模块联合训练，有效地拉直了其弯曲的反向轨迹。在不同规模数据集上的实验结果表明，SF-Speech超越了当前最先进的零样本TTS方法，并且仅需四分之一的求解器步数，其生成速度约为Voicebox和E2 TTS的3.7倍。音频样本可在演示页面获取\footnote{[在线] 访问：https://lixuyuan102.github.io/Demo/}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日