Action recognition has long been a fundamental and intriguing problem in artificial intelligence. The task is challenging due to the high dimensionality nature of an action, as well as the subtle motion details to be considered. Current state-of-the-art approaches typically learn from articulated motion sequences in the straightforward 3D Euclidean space. However, the vanilla Euclidean space is not efficient for modeling important motion characteristics such as the joint-wise angular acceleration, which reveals the driving force behind the motion. Moreover, current methods typically attend to each channel equally and lack theoretical constrains on extracting task-relevant features from the input. In this paper, we seek to tackle these challenges from three aspects: (1) We propose to incorporate an acceleration representation, explicitly modeling the higher-order variations in motion. (2) We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention, where different representations (i.e., streams) supplement each other towards a more precise action recognition while attention capitalizes on those important channels. (3) We explore feature-level supervision for maximizing the extraction of task-relevant information and formulate this into a mutual information loss. Empirically, our approach sets the new state-of-the-art performance on three benchmark datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Our code is anonymously released at https://github.com/ActionR-Group/Stream-GCN, hoping to inspire the community.
翻译:动作识别一直是人工智能领域中基础且富有挑战性的问题。由于动作具有高维特性且需考虑细微的运动细节,该任务具有较大难度。当前最先进方法通常采用直接的三维欧氏空间对关节运动序列进行学习。然而,标准欧氏空间难以高效建模诸如关节角加速度等揭示运动驱动力的重要运动特征。此外,现有方法通常平等关注每个通道,且缺乏从输入中提取任务相关特征的理论约束。本文从三个方面应对这些挑战:(1) 提出引入加速度表征,显式建模运动中的高阶变化;(2) 设计配备多流组件与通道注意力的新型Stream-GCN网络,不同表征(即流)通过相互补充实现更精准的动作识别,同时注意力机制聚焦重要通道;(3) 探索特征级监督以最大化提取任务相关信息,并将其形式化为互信息损失函数。实验表明,本方法在NTU RGB+D、NTU RGB+D 120和NW-UCLA三个基准数据集上均达到最先进性能。我们匿名公开代码于https://github.com/ActionR-Group/Stream-GCN,期望推动该领域研究发展。