MamBOA: State-Space Architecture for Video Recognition

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

翻译：细粒度动作识别要求对于通用架构通过不同成本-精度权衡来处理的时间推理能力：3D密集算子将计算与输入体积耦合，而基于差分的方法通过无上下文化特征的刚性手工减法近似运动——每种方法都体现了深思熟虑的设计选择，同时在表达能力或灵活性方面存在相应局限性。我们提出MamBOA，一种基于新型交错扫描结构的骨干无关时间框架，该结构将选择性状态空间递归（S6）重塑为原生运动合成器。通过将从预训练骨干提取的连续特征表示交错成单个交替序列，所提出的扫描在结构上驱动递归在共享隐藏状态中编码每个位置的时序观测，二者仅相隔一个衰减步骤——从而使帧间过渡成为状态动力学的内在组成部分，而非外部计算量。随后，一系列专用对齐与解码操作将这种联合编码提炼为显式运动表示，双路径池化机制通过平衡注意力驱动选择与均匀时间覆盖来自适应聚合该表示。该框架与CNN、Transformer及Mamba骨干家族无缝对接，每对特征仅增加约2.1 GFLOPs。在Diving48数据集上，MamBOA使用图像预训练骨干达到85.02%的Top-1准确率，使用视频预训练骨干单次前向传播处理完整视频达到86.24%——证明结构诱导的状态空间动力学构成了运动建模的基本原则性通用基础。