视觉语言与骨架的融合：基于跨模态知识的渐进式蒸馏用于三维动作表示学习 (Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning)

Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: https://github.com/cseeyangchen/C2VL.

翻译：基于骨架的动作表示学习旨在通过对骨架序列进行编码来解读和理解人类行为，其训练范式主要可分为两类：监督学习和自监督学习。然而，前者的一维分类需要大量人工预定义动作类别标注，而后者在预训练任务中涉及的骨架变换（如裁剪）可能破坏骨架结构。为应对这些挑战，我们提出了一种新颖的基于骨架的训练框架（C$^2$VL），该框架基于跨模态对比学习，通过渐进式蒸馏从视觉语言知识提示中学习任务无关的人类骨架动作表示。具体而言，我们通过预训练大型多模态模型（LMMs）生成的视觉语言知识提示构建视觉语言动作概念空间，该空间丰富了骨架动作空间所缺乏的细粒度细节。此外，我们在跨模态表示学习过程中提出了模态内自相似性和模态间交叉一致性软化目标，以渐进式控制和引导视觉语言知识提示与对应骨架的拉近程度。这些软实例判别和自知识蒸馏策略有助于从带噪声的骨架-视觉-语言对中学习更优的骨架动作表示。在推理阶段，我们的方法仅需骨架数据作为动作识别输入，不再依赖视觉语言提示。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的大量实验表明，我们的方法优于现有方法并取得了最先进的结果。代码发布于：https://github.com/cseeyangchen/C2VL。