Human action recognition (HAR) plays a key role in various applications such as video analysis, surveillance, autonomous driving, robotics, and healthcare. Most HAR algorithms are developed from RGB images, which capture detailed visual information. However, these algorithms raise concerns in privacy-sensitive environments due to the recording of identifiable features. Event cameras offer a promising solution by capturing scene brightness changes sparsely at the pixel level, without capturing full images. Moreover, event cameras have high dynamic ranges that can effectively handle scenarios with complex lighting conditions, such as low light or high contrast environments. However, using event cameras introduces challenges in modeling the spatially sparse and high temporal resolution event data for HAR. To address these issues, we propose the SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data. Additionally, to improve the locality of modeling, a spiking window-based linear attention mechanism is used. Extensive experiments show that SpikMamba achieves remarkable recognition performance, surpassing the previous state-of-the-art by 1.45%, 7.22%, 0.15%, and 3.92% on the PAF, HARDVS, DVS128, and E-FAction datasets, respectively. The code is available at https://github.com/Typistchen/SpikMamba.
翻译:人体动作识别在视频分析、监控、自动驾驶、机器人技术和医疗保健等多种应用中扮演着关键角色。大多数人体动作识别算法是基于RGB图像开发的,这些图像捕捉了详细的视觉信息。然而,由于记录了可识别的特征,这些算法在注重隐私的环境中引发了担忧。事件相机提供了一种有前景的解决方案,它稀疏地在像素级别捕捉场景亮度变化,而无需捕获完整图像。此外,事件相机具有高动态范围,能够有效处理具有复杂光照条件的场景,例如低光或高对比度环境。然而,使用事件相机带来了对空间稀疏且高时间分辨率的事件数据进行建模以用于人体动作识别的挑战。为了解决这些问题,我们提出了SpikMamba框架,该框架结合了脉冲神经网络的能效优势和Mamba的长序列建模能力,以高效地从空间稀疏且高时间分辨率的事件数据中捕获全局特征。此外,为了提高建模的局部性,采用了一种基于脉冲窗口的线性注意力机制。大量实验表明,SpikMamba取得了显著的识别性能,在PAF、HARDVS、DVS128和E-FAction数据集上分别超越了先前的最佳性能1.45%、7.22%、0.15%和3.92%。代码可在 https://github.com/Typistchen/SpikMamba 获取。