Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io .
翻译:视频到音频(V2A)生成旨在从无声视频中合成内容匹配的音频,而构建具有高生成质量、高效率和视听时序同步性的V2A模型仍然具有挑战性。我们提出了Frieren,一种基于修正流匹配的V2A模型。Frieren通过直线路径从噪声回归到频谱图潜空间的条件传输向量场,并通过求解常微分方程进行采样,在音频质量方面优于自回归和基于分数的模型。通过采用基于前馈Transformer的非自回归向量场估计器,以及具有强时序对齐能力的通道级跨模态特征融合,我们的模型生成的音频与输入视频高度同步。此外,通过引导向量场的回流和一步蒸馏,我们的模型能够在少数甚至仅一个采样步骤中生成高质量的音频。实验表明,Frieren在VGGSound数据集上实现了生成质量和时序对齐的最先进性能,对齐准确率达到97.22%,初始分数相较于强大的基于扩散的基线模型提高了6.2%。音频样本可在 http://frieren-v2a.github.io 获取。