Action detection in real-world scenarios is particularly challenging due to densely distributed actions in hour-long untrimmed videos. It requires modeling both short- and long-term temporal relationships while handling significant intra-class temporal variations. Previous state-of-the-art (SOTA) Transformer-based architectures, though effective, are impractical for real-world deployment due to their high parameter count, GPU memory usage, and limited throughput, making them unsuitable for very long videos. In this work, we innovatively adapt the Mamba architecture for action detection and propose Multi-scale Temporal Mamba (MS-Temba), comprising two key components: Temporal Mamba (Temba) Blocks and the Temporal Mamba Fuser. Temba Blocks include the Temporal Local Module (TLM) for short-range temporal modeling and the Dilated Temporal SSM (DTS) for long-range dependencies. By introducing dilations, a novel concept for Mamba, TLM and DTS capture local and global features at multiple scales. The Temba Fuser aggregates these scale-specific features using Mamba to learn comprehensive multi-scale representations of untrimmed videos. MS-Temba is validated on three public datasets, outperforming SOTA methods on long videos and matching prior methods on short videos while using only one-eighth of the parameters.
翻译:在现实场景中,由于未修剪的长视频中存在密集分布的动作,动作检测任务尤为困难。这需要同时对短期和长期的时序关系进行建模,并处理显著的类内时序变化。先前基于Transformer的最先进架构虽然有效,但由于其参数量大、GPU内存占用高且吞吐量有限,在实际部署中并不实用,尤其不适合处理超长视频。在本工作中,我们创新性地将Mamba架构应用于动作检测,提出了多尺度时序Mamba模型(MS-Temba),该模型包含两个关键组件:时序Mamba块(Temba Blocks)和时序Mamba融合器(Temporal Mamba Fuser)。Temba块包括用于短程时序建模的时序局部模块(TLM)和用于长程依赖建模的扩张时序状态空间模型(DTS)。通过为Mamba引入扩张这一新概念,TLM和DTS能够在多个尺度上捕获局部和全局特征。Temba融合器则利用Mamba聚合这些特定尺度的特征,以学习未修剪视频的全面多尺度表示。MS-Temba在三个公共数据集上进行了验证,其在长视频上的性能优于最先进方法,在短视频上的性能与先前方法相当,而参数量仅为前者的八分之一。