Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial-temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the state-of-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-Action-Recognition.
翻译:现有的基于多模态的人类动作识别方法计算密集,限制了其在实时应用中的部署。本文提出了一种新颖且高效的姿态驱动注意力引导多模态网络(EPAM-Net),用于视频中的动作识别。具体而言,我们为RGB流和姿态流提出了扩展时序移位卷积架构(X-ShiftNet),以从RGB视频及其骨架序列中捕获时空特征。X-ShiftNet通过将时序移位模块(TSM)集成到高效的2D CNN中,解决了3D CNN的高计算成本问题,实现了高效的时空学习。随后,利用骨架特征来引导视觉网络流,通过提出的时空注意力块聚焦于关键帧及其显著空间区域。最后,融合两个流的预测结果进行最终分类。实验结果表明,我们的方法在显著减少浮点运算(FLOPs)的同时,在NTU RGB-D 60、NTU RGB-D 120、PKU-MMD和Toyota SmartHome数据集上优于或可与最先进方法竞争。所提出的EPAM-Net实现了高达72.8倍的FLOPs减少和高达48.6倍的网络参数量减少。代码将在 https://github.com/ahmed-nady/Multimodal-Action-Recognition 公开。