Skeleton-based action recognition is a central task of human-computer interaction. However, most of the previous methods suffer from two issues: (i) semantic ambiguity arising from spatiotemporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to local optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into almost all previous skeleton encoders and have no impact on the skeleton encoders when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatiotemporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvement on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released.
翻译:基于骨架的动作识别是人机交互的核心任务。然而,现有方法大多面临两个问题:(i)时空信息混合导致的语义模糊;(ii)忽视对潜在数据分布(即类内差异与类间关系)的显式挖掘,从而使得骨架编码器陷入局部最优解。为解决此问题,我们提出时空解耦对比学习(STD-CL)框架,从序列中获取具有判别性和语义区分性的表征,该框架可无缝集成至几乎所有现有骨架编码器,且测试时对编码器性能无影响。具体而言,我们将全局特征解耦为空间特异特征与时间特异特征,以降低特征的时空耦合性。此外,为显式利用潜在数据分布,我们将注意力特征引入对比学习,通过拉近正样本对特征、推远负样本对特征来建模跨序列语义关系。大量实验表明,STD-CL结合四种不同骨架编码器(HCN、2S-AGCN、CTR-GCN和Hyperformer),在NTU60、NTU120和NW-UCLA基准数据集上均取得了显著性能提升。代码将开源。