Mamba-based Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearances for motion perception. However, the high foreground-background similarity in VCOD limits the discriminability of such features (e.g. color and texture). Recent studies demonstrate that frequency features can not only compensate for appearance limitations, but also perceive motion through dynamic variations in spectral energy. Meanwhile, the emerging state space model called Mamba enables efficient motion perception in frame sequences with its linear-time long-sequence modeling capability. Motivated by this, we propose Vcamba, a visual camouflage Mamba based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, by analyzing the spatial representations of frequency components, we reveal a structural evolution pattern that emerges from the ordered superposition of components. Based on this observation, we propose a unique frequency-domain sequential scanning (FSS) strategy to unfold the spectrum. Utilizing FSS, the adaptive frequency enhancement (AFE) module employs Mamba to model the causal dependencies within sequences, enabling effective frequency learning. Furthermore, we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features into unified motion representation. Experiments show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming its superiority. Code is available at: https://github.com/BoydeLi/Vcamba.

翻译：现有视频伪装目标检测方法主要依赖空间外观特征进行运动感知。然而，VCOD任务中前景与背景的高度相似性限制了此类特征（如颜色与纹理）的判别能力。近期研究表明，频域特征不仅能弥补外观特征的局限性，还能通过频谱能量的动态变化感知运动信息。与此同时，新兴的状态空间模型Mamba凭借其线性时间复杂度的长序列建模能力，可在帧序列中实现高效的运动感知。受此启发，我们提出Vcamba——一种基于时空频运动感知的视觉伪装Mamba模型，通过融合频域与空间特征实现高效精准的VCOD检测。具体而言，通过分析频率分量的空间表征，我们揭示了由分量有序叠加产生的结构演化规律。基于此发现，我们提出一种独特的频域序列扫描策略来展开频谱。利用FSS策略，自适应频率增强模块采用Mamba建模序列内的因果依赖关系，实现有效的频域特征学习。此外，我们提出基于空间的长程运动感知模块和基于频域的长程运动感知模块，分别对时空序列与频时序列进行建模。最后，空频运动融合模块将双域特征整合为统一的运动表征。实验表明，Vcamba在2个数据集上的6项评估指标均超越现有最优方法，且计算成本更低，证实了其优越性。代码已开源：https://github.com/BoydeLi/Vcamba。