In recent years, remarkable results have been achieved in self-supervised action recognition using skeleton sequences with contrastive learning. It has been observed that the semantic distinction of human action features is often represented by local body parts, such as legs or hands, which are advantageous for skeleton-based action recognition. This paper proposes an attention-based contrastive learning framework for skeleton representation learning, called SkeAttnCLR, which integrates local similarity and global features for skeleton-based action representations. To achieve this, a multi-head attention mask module is employed to learn the soft attention mask features from the skeletons, suppressing non-salient local features while accentuating local salient features, thereby bringing similar local features closer in the feature space. Additionally, ample contrastive pairs are generated by expanding contrastive pairs based on salient and non-salient features with global features, which guide the network to learn the semantic representations of the entire skeleton. Therefore, with the attention mask mechanism, SkeAttnCLR learns local features under different data augmentation views. The experiment results demonstrate that the inclusion of local feature similarity significantly enhances skeleton-based action representation. Our proposed SkeAttnCLR outperforms state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.
翻译:近年来,利用骨架序列进行对比学习的自监督动作识别取得了显著成果。研究表明,人体动作特征的语义区分往往由局部身体部位(如腿部或手部)体现,这对基于骨架的动作识别具有优势。本文提出一种基于注意力机制的对比学习框架SkeAttnCLR,用于骨架表征学习,该框架整合了局部相似性与全局特征以形成骨架动作表征。为实现此目标,采用多头注意力掩码模块从骨架中学习软注意力掩码特征,抑制非显著局部特征的同时增强局部显著特征,从而在特征空间中拉近相似局部特征。此外,通过基于显著与非显著特征与全局特征扩展对比对,生成充足的对比样本,指导网络学习整个骨架的语义表征。借助注意力掩码机制,SkeAttnCLR可在不同数据增强视角下学习局部特征。实验结果表明,局部特征相似性的引入显著提升了基于骨架的动作表征能力。所提出的SkeAttnCLR在NTURGB+D、NTU120-RGB+D和PKU-MMD数据集上均优于现有最优方法。