Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this paper, we introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. We show that, during inference, our end-to-end network pays attention to a set of joints specific to each action, whose encoded spatio-temporal features are aggregated to recognize the action. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements.
翻译:由于人体骨骼能够提供紧凑且富含高层级表示,基于骨骼的人体动作识别近年来已成为一个高度活跃的研究课题。以往研究表明,探究空间和时间维度上的关节关系能够提供对动作识别至关重要的有效信息。然而,在时空特征提取过程中有效编码关节的全局依赖关系仍然是一个挑战。本文引入了动作胶囊(Action Capsule),它通过考虑骨架序列中关节的潜在相关性来识别与动作相关的关键关节。我们证明了,在推理过程中,我们的端到端网络会关注每个动作特有的关节集合,并将这些关节编码的时空特征聚合以识别该动作。此外,多阶段动作胶囊的使用增强了网络对相似动作的分类能力。因此,我们的网络在N-UCLA数据集上优于当前最先进的方法,并在NTURGBD数据集上取得了有竞争力的结果。同时,根据GFLOPs测量,我们的方法计算需求显著降低。