Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and Intra-modal Mutual Distillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the Intra-modal Mutual Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.
翻译:近期自监督三维人体动作表征学习的进展主要归功于对比学习。然而在传统对比框架中,不同骨骼模态间的丰富互补性尚未得到充分探索。此外,当模型通过区分自增强样本进行优化时,在动作类别有限的情况下,会面临大量相似正样本的挑战。针对上述问题,本文提出通用型模态间与模态内互蒸馏(I$^2$MD)框架。在该框架中,我们首先将跨模态交互重构为跨模态互蒸馏(CMD)过程。与现有将预训练固定教师模型知识迁移至学生的蒸馏方案不同,CMD在预训练过程中实现模态间知识的持续更新与双向蒸馏。为缓解相似样本干扰并挖掘其潜在上下文信息,我们进一步设计了模态内互蒸馏(IMD)策略。在IMD中,首次引入动态邻居聚合(DNA)机制,每个模态内实例化一个额外的聚类级判别分支。该机制自适应聚合高相关邻域特征,形成局部聚类级对比。随后在两个分支间执行互蒸馏以实现跨层级知识交换。在三个数据集上的大量实验表明,本方法创下了一系列新纪录。