Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or elaborate separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI.
翻译:有效提取帧间运动与外观信息对于视频帧插值(VFI)至关重要。以往的工作要么以混合方式提取这两类信息,要么为每类信息设计独立的模块,这会导致表示模糊且效率低下。本文提出了一种新颖的模块,通过统一操作显式提取运动与外观信息。具体而言,我们重新审视帧间注意力中的信息处理过程,并复用其注意力图,同时用于外观特征增强和运动信息提取。此外,为了实现高效VFI,所提模块可无缝集成到混合CNN与Transformer架构中。这种混合流水线既能降低帧间注意力的计算复杂度,又能保留详细的底层结构信息。实验结果表明,在固定时间步长和任意时间步长插值任务中,我们的方法在多种数据集上均达到最先进性能。同时,与性能相近的模型相比,我们的方法计算开销更轻。源代码和模型已开源至 https://github.com/MCG-NJU/EMA-VFI。