Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.
翻译:基于骨骼的动作识别方法通过分析关节点坐标及其连接关系对动作进行分类,广泛应用于多种场景。尽管图卷积网络(GCNs)已被提出用于处理以图结构表示的骨骼数据,但受限于关节点连接约束导致的感受野有限问题。为克服这一局限,近期研究引入了基于Transformer的方法。然而,捕捉所有帧中所有关节点之间的相关性需要消耗大量内存资源。为缓解该问题,我们提出名为骨骼-时间Transformer(SkateFormer)的创新方法,该方法根据不同类型的骨骼-时间关系(Skate-Type)对关节点和帧进行划分,并在每个划分区域内执行骨骼-时间自注意力机制(Skate-MSA)。我们将动作识别的关键骨骼-时间关系划分为四种类型,这些类型组合了(i)基于物理邻近关节与远距离关节的两种骨骼关系类型,以及(ii)基于邻近帧与远距离帧的两种时间关系类型。通过这种分区注意力策略,我们的SkateFormer能够以动作自适应方式高效计算,选择性地聚焦于动作识别中关键的关节点和帧。在多个基准数据集上的大量实验表明,我们的SkateFormer超越了当前最先进的方法。