Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.
翻译:基于骨架的动作识别是人机交互中的核心任务。然而,现有方法普遍面临两个问题:(i)时空信息混合导致的语义模糊性;(ii)忽视对潜在数据分布(即类内差异和类间关系)的显式利用,从而导致骨架编码器陷入次优解。为解决上述问题,本文提出一种时空解耦对比学习框架(STD-CL),从序列中获取具有判别性和语义区分度的特征表示。该框架可集成至多种现有骨架编码器中,且测试阶段可移除。具体而言,我们将全局特征解耦为空间特定特征和时间特定特征,以降低特征的时空耦合度。此外,为显式利用潜在数据分布,我们将注意力特征引入对比学习:通过拉近正样本对特征、推远负样本对特征,建模跨序列的语义关联。大量实验表明,STD-CL框架在NTU60、NTU120和NW-UCLA基准测试中,与四种不同骨架编码器(HCN、2S-AGCN、CTR-GCN和Hyperformer)结合均取得显著性能提升。相关代码即将开源。