Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0.80 dB for 4K frames and 0.96 dB for 2K frames.
翻译:帧间建模在视频帧插值(VFI)中生成中间帧至关重要。当前方法主要依赖于卷积或基于注意力的模型,这些模型往往要么缺乏足够的感受野,要么带来显著的计算开销。最近,专为长序列建模设计的选择性状态空间模型(S6)应运而生,它同时提供了线性复杂度和数据依赖的建模能力。本文提出VFIMamba,一种利用S6模型进行高效动态帧间建模的新型帧插值方法。我们的方法引入了混合状态空间模块(MSB),该模块首先以交错方式重排相邻帧的令牌,随后应用多方向S6建模。这种设计在保持线性复杂度的同时,促进了跨帧信息的高效传递。此外,我们提出了一种新颖的课程学习策略,该策略逐步培养模型在不同运动幅度下进行帧间动态建模的能力,从而充分释放S6模型的潜力。实验结果表明,我们的方法在多种基准测试中达到了最先进的性能,尤其在高分辨率场景下表现优异。具体而言,在X-TEST数据集上,VFIMamba在4K帧上实现了0.80 dB的显著提升,在2K帧上实现了0.96 dB的提升。