Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.
翻译:自监督人体动作表征学习近年来发展迅速。现有方法大多基于骨架数据并采用多模态框架,但忽略了模态间性能差异,导致错误知识在模态间传播,且仅使用关节、骨骼和运动这三种基础模态,未探索其他模态。本文首先提出隐式知识交换模块(IKEM),缓解低性能模态间的错误知识传播;其次,我们提出三种新模态以丰富模态间的互补信息;最后,为在引入新模态时保持效率,提出基于锚点、正例和负例关系约束的新型师生框架,将次要模态的知识蒸馏至主要模态,称为关系型跨模态知识蒸馏。实验结果验证了方法的有效性,实现了高效利用骨架多模态数据。源代码将发布于 https://github.com/desehuileng0o0/IKEM。