MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

In recent years, skeleton-based action recognition, leveraging multimodal Graph Convolutional Networks (GCN), has achieved remarkable results. However, due to their deep structure and reliance on continuous floating-point operations, GCN-based methods are energy-intensive. We propose an innovative Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation (MK-SGN) to address this issue. By merging the energy efficiency of Spiking Neural Network (SNN) with the graph representation capability of GCN, the proposed MK-SGN reduces energy consumption while maintaining recognition accuracy. Firstly, we convert Graph Convolutional Networks (GCN) into Spiking Graph Convolutional Networks (SGN) establishing a new benchmark and paving the way for future research exploration. During this process, we introduce a spiking attention mechanism and design a Spiking-Spatio Graph Convolution module with a Spatial Global Spiking Attention mechanism (SA-SGC), enhancing feature learning capability. Secondly, we propose a Spiking Multimodal Fusion module (SMF), leveraging mutual information to process multimodal data more efficiently. Lastly, we delve into knowledge distillation methods from multimodal GCN to SGN and propose a novel, integrated method that simultaneously focuses on both intermediate layer distillation and soft label distillation to improve the performance of SGN. MK-SGN outperforms the state-of-the-art GCN-like frameworks on three challenging datasets for skeleton-based action recognition in reducing energy consumption. It also outperforms the state-of-the-art SNN frameworks in accuracy. Specifically, our method reduces energy consumption by more than 98% compared to typical GCN-based methods, while maintaining competitive accuracy on the NTU-RGB+D 60 cross-subject split using 4-time steps.

翻译：近年来，利用多模态图卷积网络（GCN）的基于骨架的动作识别取得了显著成果。然而，由于其深层结构及对连续浮点运算的依赖，基于GCN的方法能耗较高。为解决此问题，我们提出了一种创新的融合多模态与知识蒸馏的脉冲图卷积网络（MK-SGN）。通过将脉冲神经网络（SNN）的能效优势与GCN的图表示能力相结合，所提出的MK-SGN在保持识别精度的同时降低了能耗。首先，我们将图卷积网络（GCN）转换为脉冲图卷积网络（SGN），建立了新的基准，并为未来的研究探索铺平了道路。在此过程中，我们引入了脉冲注意力机制，并设计了一个带有空间全局脉冲注意力机制（SA-SGC）的脉冲时空图卷积模块，增强了特征学习能力。其次，我们提出了一个脉冲多模态融合模块（SMF），利用互信息更高效地处理多模态数据。最后，我们深入研究了从多模态GCN到SGN的知识蒸馏方法，并提出了一种新颖的集成方法，该方法同时关注中间层蒸馏和软标签蒸馏，以提升SGN的性能。在三个具有挑战性的基于骨架动作识别数据集上，MK-SGN在降低能耗方面优于最先进的类GCN框架，同时在精度上超越了最先进的SNN框架。具体而言，在NTU-RGB+D 60跨受试者划分数据集上使用4个时间步长时，我们的方法相比典型的基于GCN的方法降低了超过98%的能耗，同时保持了具有竞争力的精度。