Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer
翻译:图卷积网络在基于骨架的动作识别中已被广泛采用并取得显著成果。我们认为骨架动作识别的关键在于帧间骨架的时序关联,因此重点研究图卷积网络如何在全局与局部时间维度上学习不同拓扑结构并有效聚合关节特征。本文提出三种基于通道拓扑精化图卷积的通道拓扑图卷积方法。将通道拓扑精化图卷积与两个关节交叉注意力模块结合,可捕获上下身体部位及手足关系的骨架特征。为捕捉帧间人体骨架的变化特征,我们设计了时间注意力Transformer以高效提取骨架序列的时序特征。最终通过多层感知机与分类器融合时序特征的输出尺度。我们构建了名为时空有效身体部位交叉注意力Transformer的强大图卷积网络,在NTU RGB+D和NTU RGB+D 120数据集上展现出卓越性能。代码与模型已开源至https://github.com/maclong01/STEP-CATFormer