Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this paper first applies a new contrastive learning method called BYOL to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this paper further presents a Cross-Model and Cross-Stream (CMCS) framework. This framework combines Cross-Model Adversarial Learning (CMAL) and Cross-Stream Collaborative Learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multi-stream information, CSCL is designed by generating similarity pseudo label of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.
翻译:考虑到实例级判别能力,对比学习方法(包括MoCo和SimCLR)已从原始图像表示学习任务迁移至解决自监督基于骨架的动作识别任务。这些方法通常使用多数据流(即关节、运动和骨骼)进行集成学习,然而,如何在单一流内构建判别性特征空间并有效聚合多流信息仍是一个开放性问题。为此,本文首先应用一种称为BYOL的新对比学习方法从骨架数据中学习,进而将SkeletonBYOL构建为自监督基于骨架动作识别的一个简单而有效的基线。受SkeletonBYOL启发,本文进一步提出跨模型跨流(CMCS)框架。该框架结合了跨模型对抗学习(CMAL)与跨流协作学习(CSCL)。具体而言,CMAL通过跨模型对抗损失学习单流表示,以获得更具判别性的特征。为聚合并交互多流信息,CSCL通过生成集成学习的相似性伪标签作为监督,并指导各流的特征生成而设计。在三个数据集上的大量实验验证了CMAL与CSCL之间的互补特性,同时也验证了所提方法在使用多种评估协议下能够取得优于现有最先进方法的结果。