Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN). In addition, context-dependent adaptive topology as a neighborhood vertex information and attention mechanism leverages a model to better represent actions. In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN to effectively improve modeling ability to achieve state-of-the-art results on several datasets. We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node. These two are followed by multi-scale convolution network with dilations, which not only captures the long-range temporal dependencies of joints but also the long-range spatial dependencies (i.e., long-distance dependencies) of node temporal behaviors. They are combined into high-level spatial-temporal representations and output the predicted action with the softmax classifier.
翻译:基于骨架的手势识别方法利用图卷积网络(GCN)取得了显著成功。此外,上下文相关的自适应拓扑作为邻域顶点信息,以及注意力机制,使模型能够更好地表征动作。本文提出一种自注意力GCN混合模型——多尺度时空自注意力图卷积网络(MSST-GCN),以有效提升建模能力,在多个数据集上实现最先进的结果。我们利用具有自适应拓扑的空间自注意力模块,来理解帧内不同身体部位之间的交互,并利用时间自注意力模块,来分析节点帧间的相关性。这两个模块之后是带扩张的多尺度卷积网络,该网络不仅能捕捉关节的长时间依赖关系,还能捕捉节点时间行为的长时间空间依赖(即长距离依赖)。通过这些模块组合成高级的时空表征,并利用Softmax分类器输出预测的动作。