Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations, such as nontrivial input preprocessing, intricate network construction/training, and the need for repetitive processing of the same video clip, hindering their practical deployment. In this study, we innovatively propose the directly trained SVFormer (Spiking Video transFormer) for VAR. SVFormer integrates local feature extraction, global self-attention, and the intrinsic dynamics, sparsity, and spike-driven nature of SNNs, to efficiently and effectively extract spatio-temporal features. We evaluate SVFormer on two RGB datasets (UCF101, NTU-RGBD60) and one neuromorphic dataset (DVS128-Gesture), demonstrating comparable performance to the mainstream models in a more efficient way. Notably, SVFormer achieves a top-1 accuracy of 84.03% with ultra-low power consumption (21 mJ/video) on UCF101, which is state-of-the-art among directly trained deep SNNs, showcasing significant advantages over prior models.
翻译:视频动作识别在监控、医疗保健和工业自动化等多个领域具有关键作用,对社会具有重要意义。因此,它长期以来一直是计算机视觉领域的研究热点。随着人工神经网络的蓬勃发展,包括2D-CNN和3D-CNN在内的卷积神经网络,以及视觉Transformer的变体,在视频动作识别任务上已展现出令人印象深刻的性能。然而,由于时间维度引入的数据量大和信息冗余度高,这些模型通常需要巨大的计算成本。为应对这一挑战,一些研究者转向了受大脑启发的脉冲神经网络,例如循环SNN和由ANN转换而来的SNN,以利用其固有的时间动态特性和高能效。然而,当前用于视频动作识别的SNN也存在局限性,例如非平凡的输入预处理、复杂的网络构建/训练过程,以及对同一视频片段需要重复处理,这些因素阻碍了其实际部署。在本研究中,我们创新性地提出了用于视频动作识别的直接训练模型SVFormer。SVFormer融合了局部特征提取、全局自注意力机制,以及SNN固有的动态性、稀疏性和脉冲驱动特性,从而高效地提取时空特征。我们在两个RGB数据集(UCF101、NTU-RGBD60)和一个神经形态数据集(DVS128-Gesture)上评估了SVFormer,结果表明其在保持更高效率的同时,性能可与主流模型相媲美。值得注意的是,SVFormer在UCF101数据集上实现了84.03%的top-1准确率,且功耗极低(21 mJ/视频),这在直接训练的深度SNN中达到了最先进水平,展现出相较于先前模型的显著优势。