Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io .

翻译：视频到音频（V2A）生成旨在从无声视频中合成内容匹配的音频，而构建具有高生成质量、高效率和视听时序同步性的V2A模型仍然具有挑战性。我们提出了Frieren，一种基于修正流匹配的V2A模型。Frieren通过直线路径从噪声回归到频谱图潜空间的条件传输向量场，并通过求解常微分方程进行采样，在音频质量方面优于自回归和基于分数的模型。通过采用基于前馈Transformer的非自回归向量场估计器，以及具有强时序对齐能力的通道级跨模态特征融合，我们的模型生成的音频与输入视频高度同步。此外，通过引导向量场的回流和一步蒸馏，我们的模型能够在少数甚至仅一个采样步骤中生成高质量的音频。实验表明，Frieren在VGGSound数据集上实现了生成质量和时序对齐的最先进性能，对齐准确率达到97.22%，初始分数相较于强大的基于扩散的基线模型提高了6.2%。音频样本可在 http://frieren-v2a.github.io 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日