Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
翻译:视频融合是多种视频处理任务中的一项基础技术。然而,现有的视频融合方法严重依赖于光流估计与特征扭曲,导致严重的计算开销和有限的可扩展性。本文提出了MambaVF,一种基于状态空间模型的高效视频融合框架,它无需显式的运动估计即可进行时序建模。首先,通过将视频融合重新表述为一个序列状态更新过程,MambaVF能够以线性复杂度捕获长程时序依赖,同时显著降低计算和内存成本。其次,MambaVF提出了一个轻量级的基于SSM的融合模块,该模块通过时空双向扫描机制取代了传统的基于光流的对齐方法。该模块实现了跨帧的高效信息聚合。在多个基准测试上的广泛实验表明,我们的MambaVF在多曝光、多焦点、红外-可见光以及医学视频融合任务中均达到了最先进的性能。我们特别指出,MambaVF具有高效率,与现有方法相比,参数减少了高达92.25%,计算FLOPs减少了88.79%,并实现了2.1倍的加速。项目页面:https://mambavf.github.io