Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleton sequence for contrastive learning. However, due to the rich action clues in the skeleton sequences, existing methods may only take a global perspective to learn to discriminate different skeletons without thoroughly leveraging the local relationship between different skeleton joints and video frames, which is essential for real-world applications. In this work, we propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship from a partial skeleton sequences built by a unique spatio-temporal masking strategy. Specifically, we construct a negative-sample-free triplet steam structure that is composed of an anchor stream without any masking, a spatial masking stream with Central Spatial Masking (CSM), and a temporal masking stream with Motion Attention Temporal Masking (MATM). The feature cross-correlation matrix is measured between the anchor stream and the other two masking streams, respectively. (1) Central Spatial Masking discards selected joints from the feature calculation process, where the joints with a higher degree of centrality have a higher possibility of being selected. (2) Motion Attention Temporal Masking leverages the motion of action and remove frames that move faster with a higher possibility. Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGB+D 120 and PKU-MMD under various downstream tasks. Furthermore, a practical evaluation is performed where some skeleton joints are lost in downstream tasks.In contrast to previous methods that suffer from large performance drops, our PSTL can still achieve remarkable results under this challenging setting, validating the robustness of our method.

翻译：自监督学习在基于骨架的动作识别表示学习中已展现出显著能力。现有方法主要关注应用全局数据增强生成骨架序列的不同视图以进行对比学习。然而，由于骨架序列中蕴含丰富的动作线索，现有方法可能仅从全局视角学习区分不同骨架，未能充分利用不同骨骼关节点与视频帧之间的局部关系，而这对于实际应用至关重要。本文提出一种部分时空学习（PSTL）框架，通过独特的时空掩码策略构建的部分骨架序列来挖掘局部关系。具体而言，我们构建了一种无负样本的三流结构，包括无任何掩码的锚点流、带有中央空间掩码（CSM）的空间掩码流以及带有运动注意力时间掩码（MATM）的时间掩码流。分别测量锚点流与另外两个掩码流之间的特征互相关矩阵：（1）中央空间掩码从特征计算过程中丢弃选中的关节点，其中中心性更高的关节点被选中的概率更大；（2）运动注意力时间掩码利用动作的运动信息，以更高概率移除运动速度更快的帧。在NTURGB+D 60、NTURGB+D 120和PKU-MMD数据集上，我们的方法在多种下游任务中达到了最先进性能。此外，我们还进行了实际场景评估——当部分骨骼关节点在下游任务中缺失时，与先前方法性能大幅下降不同，我们的PSTL在此挑战性设置下仍能取得显著结果，验证了方法的鲁棒性。