In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART
翻译:本研究探讨了利用跟踪和三维姿态进行动作识别的优势。为此,我们采用拉格朗日视角沿人体运动轨迹而非空间固定点分析动作。这一立场使我们能够利用人体轨迹段预测其动作。基于此思路,我们首先展示了利用三维姿态推断动作的优势,并研究人与人之间的交互。随后,我们提出了一种拉格朗日动作识别模型,通过融合三维姿态与轨迹段上的上下文外观。最终,我们的方法在AVA v2.2数据集上,在仅含姿态设定和标准基准设定下均达到了最优性能。当仅利用姿态线索推理动作时,我们的姿态模型相较于相应最优方法获得了+10.0 mAP的提升,而融合模型相较于最佳最优模型获得了+2.8 mAP的提升。代码与结果见:https://brjathu.github.io/LART