With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
翻译:随着AIGC技术的发展,视频帧插值已成为现有视频生成框架的关键组成部分,并引起了广泛的研究关注。在视频帧插值任务中,相邻帧间的运动估计对于避免运动模糊至关重要。然而,现有方法往往难以准确预测连续帧间的运动信息,这种不精确的估计会导致插值帧出现模糊和视觉不连贯的问题。本文提出一种新颖的扩散框架——运动感知隐空间扩散模型,该模型专为视频帧插值任务设计。通过将条件相邻帧与扩散采样过程中预测的目标插值帧之间的运动先验信息相结合,该模型能够逐步优化中间结果,最终生成视觉平滑且逼真的插值帧。在基准数据集上进行的大量实验表明,本方法取得了最先进的性能,显著优于现有方法,尤其在涉及复杂运动的动态纹理等挑战性场景下表现突出。