In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at: https://github.com/maoyunyao/CMD
翻译:摘要:在三维动作识别中,不同骨骼模态之间存在丰富的互补信息,然而如何建模并利用这些信息仍是自监督三维动作表征学习中的挑战性问题。本文将跨模态交互建模为双向知识蒸馏问题。与将固定预训练教师模型的知识迁移至学生模型的经典蒸馏方案不同,本工作中知识在各模态间持续更新并双向蒸馏。为此,我们提出新的跨模态互蒸馏(CMD)框架,具体设计如下:一方面,引入邻域相似性分布以建模各模态中学到的知识,其中关系信息天然适用于对比学习框架;另一方面,采用非对称配置用于教师与学生模型以稳定蒸馏过程并在模态间传递高置信度信息。通过推导,我们发现先前工作中的跨模态正样本挖掘可视为CMD的一种退化版本。我们在NTU RGB+D 60、NTU RGB+D 120及PKU-MMD II数据集上进行了大量实验,该方法优于现有自监督方法并创下一系列新纪录。代码开源地址:https://github.com/maoyunyao/CMD