In general, deep learning-based video frame interpolation (VFI) methods have predominantly focused on estimating motion vectors between two input frames and warping them to the target time. While this approach has shown impressive performance for linear motion between two input frames, it exhibits limitations when dealing with occlusions and nonlinear movements. Recently, generative models have been applied to VFI to address these issues. However, as VFI is not a task focused on generating plausible images, but rather on predicting accurate intermediate frames between two given frames, performance limitations still persist. In this paper, we propose a multi-in-single-out (MISO) based VFI method that does not rely on motion vector estimation, allowing it to effectively model occlusions and nonlinear motion. Additionally, we introduce a novel motion perceptual loss that enables MISO-VFI to better capture the spatio-temporal correlations within the video frames. Our MISO-VFI method achieves state-of-the-art results on VFI benchmarks Vimeo90K, Middlebury, and UCF101, with a significant performance gap compared to existing approaches.
翻译:通常,基于深度学习的视频帧插值(VFI)方法主要侧重于估计两帧输入之间的运动向量,并将它们扭曲到目标时间点。虽然这种方法在处理两帧间的线性运动时表现出色,但它在处理遮挡和非线性运动时存在局限性。近年来,生成模型已被用于VFI以解决这些问题。然而,由于VFI并非专注于生成合理图像的任务,而是预测两给定帧之间的准确中间帧,因此性能限制仍然存在。在本文中,我们提出了一种基于多输入单输出(MISO)的VFI方法,该方法不依赖于运动向量估计,从而能够有效建模遮挡和非线性运动。此外,我们引入了一种新颖的运动感知损失,使MISO-VFI能够更好地捕捉视频帧内的时空相关性。我们的MISO-VFI方法在VFI基准测试Vimeo90K、Middlebury和UCF101上取得了最先进的结果,与现有方法相比存在显著的性能差距。