Skeleton-based Action Recognition through Contrasting Two-Stream Spatial-Temporal Networks

For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action ``clapping hands''). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.

翻译：为追求精准的骨架动作识别，现有方法大多采用图卷积网络（GCN）与注意力机制串行融合的策略。然而，这些方法将人体骨架视为完整图结构，导致不同动作间差异不显著（例如"拍手"动作中肘部与头部之间的连接）。为此，我们提出新型对比GCN-Transformer网络（ConGT），该网络采用空间模块与时间模块并行融合的方式。ConGT包含两个并行分支：时空图卷积流（STG）和时空Transformer流（STT）。STG旨在获取保持人体骨架自然拓扑结构的动作表征，STT则用于获取包含关节间全局关系的动作表征。由于这两个分支产生的动作表征具有不同特性且彼此信息交互有限，我们引入对比学习范式，以自监督方式引导同一样本的输出表征尽可能接近。通过对比学习，两者可相互学习信息，通过最大化两类动作表征间的互信息来丰富动作特征。为进一步提升识别精度，我们提出循环焦点损失（CFL），该损失在训练早期侧重可信样本，中期逐步增大对难例样本的关注。在三个基准数据集上的实验表明，本模型在动作识别任务中达到了最先进性能。