Video frame interpolation (VFI) aims to synthesize intermediate frames in between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works employ the high quality generative models for perceptual quality. However, they require complex training and large computational cost for modeling on the pixel space. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose disentangled two-stage training process, initially training a frame synthesis model to generate frames from input pairs and their optical flows. Subsequently, we propose a motion diffusion model, equipped with our novel diffusion U-Net architecture designed for optical flow, to produce bi-directional flows between frames. This method, by leveraging the simpler low-frequency representation of motions, achieves superior perceptual quality with reduced computational demands compared to generative modeling methods on the pixel space. Our method surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI. Our code is available at: https://github.com/JHLew/MoMo
翻译:视频帧插值(VFI)旨在合成现有帧之间的中间帧,以提升视觉流畅度与质量。传统方法主要基于重建损失,近期研究则采用高质量生成模型以改善感知质量。然而,这些方法需在像素空间进行复杂训练且计算成本高昂。本文提出解耦运动建模(MoMo),一种基于扩散模型的VFI方法,通过聚焦于中间运动建模来提升视觉质量。我们设计了分阶段解耦训练流程:首先训练帧合成模型,使其能够从输入帧对及其光流生成中间帧;随后提出运动扩散模型,该模型配备专为光流设计的新型扩散U-Net架构,用于生成帧间双向光流。该方法通过利用运动更简单的低频表示,在降低计算需求的同时,实现了优于像素空间生成建模方法的感知质量。在多项基准测试中,本方法在感知指标上超越了现有最优方法,证明了其在VFI任务中的高效性与有效性。代码已开源:https://github.com/JHLew/MoMo