Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.

翻译：基于骨架的动作识别因利用简洁且鲁棒的骨架而受到广泛关注。然而，骨架中缺乏详细的肢体信息限制了性能，而其他多模态方法需要大量推理资源，且在训练和推理阶段均使用多模态数据时效率低下。为解决此问题并充分利用互补的多模态特征，我们提出了一种新颖的多模态协同学习（MMCL）框架，通过利用多模态大语言模型（LLMs）作为辅助网络，实现高效的基于骨架的动作识别。该框架在训练阶段进行多模态协同学习，并在推理阶段仅使用简洁骨架以保持效率。我们的MMCL框架主要包括两个模块。首先，特征对齐模块（FAM）从视频帧中提取丰富的RGB特征，并通过对比学习将其与全局骨架特征对齐。其次，特征精炼模块（FRM）利用具有时序信息的RGB图像和文本指令，基于多模态LLMs强大的泛化能力生成指导性特征。这些指导性文本特征将进一步精炼分类分数，而精炼后的分数将以类似于软标签的方式增强模型的鲁棒性和泛化能力。在NTU RGB+D、NTU RGB+D 120和Northwestern-UCLA基准上的大量实验一致验证了我们MMCL的有效性，其性能优于现有的基于骨架的动作识别方法。同时，在UTD-MHAD和SYSU-Action数据集上的实验证明了我们的MMCL在零样本和领域自适应动作识别中具有出色的泛化能力。我们的代码公开于：https://github.com/liujf69/MMCL-Action。