In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.
翻译:在少样本动作识别(FSAR)中,视频的长子序列能更有效地完整表达动作。然而,主流基于Transformer的方法计算复杂度高,限制了其应用。近期提出的Mamba在长序列建模方面展现出高效性,但直接将Mamba应用于FSAR会忽略局部特征建模与对齐的重要性。此外,同一类别内的长子序列会累积类内方差,这对FSAR性能产生不利影响。为解决这些挑战,我们提出了一个嵌套式MAmba与对比学习框架(Manta)。首先,嵌套式Mamba通过引入多个内部模块来增强局部特征表示,而非直接建模全局特征。外部模块则捕获这些局部特征间的时间线依赖关系,实现隐式时间对齐。其次,设计了一种结合监督与无监督方法的混合对比学习范式,以减轻类内方差累积的负面影响。嵌套式Mamba与混合对比学习范式在Manta的两个并行分支中运行,共同增强了Mamba在长子序列FSAR任务中的能力。Manta在多个重要基准测试(包括SSv2、Kinetics、UCF101和HMDB51)上取得了新的最先进性能。广泛的实证研究表明,Manta从多个角度显著提升了长子序列的少样本动作识别性能。