Skeleton-based action recognition has recently received considerable attention. Current approaches to skeleton-based action recognition are typically formulated as one-hot classification tasks and do not fully exploit the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled from the action description. Therefore, utilizing action description in training could potentially benefit representation learning. In this work, we propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. More specifically, we employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed GAP method achieves noticeable improvements over various baseline models without extra computation cost at inference. GAP achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is available at https://github.com/MartinXM/GAP.
翻译:基于骨架的动作识别近期受到了广泛关注。当前基于骨架的动作识别方法通常被形式化为单标签分类任务,未能充分利用动作之间的语义关联。例如,“比胜利手势”和“竖大拇指”是两种手势动作,其主要区别在于手部运动——这一信息无法从动作类别的独热编码中获取,但可通过动作描述揭示。因此,在训练中利用动作描述有助于提升表示学习效果。本文提出一种基于生成式动作描述提示(GAP)的骨架动作识别方法。具体而言,我们采用预训练的大规模语言模型作为知识引擎,自动生成各动作中身体部位运动的文本描述,并提出一种多模态训练方案:利用文本编码器为不同身体部位生成特征向量,指导骨架编码器进行动作表示学习。实验表明,本文提出的GAP方法在不增加推理计算成本的前提下,显著提升了多种基线模型的性能。在NTU RGB+D、NTU RGB+D 120及NW-UCLA等主流骨架动作识别基准上,GAP取得了新的最优结果。源代码已开源至https://github.com/MartinXM/GAP。