Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.
翻译:动作识别是一个重要问题,需要通过学习场景中参与者和目标之间的复杂交互来识别视频中的动作。然而,现代基于深度学习的网络通常需要大量计算,且可能利用多种模态捕获场景上下文,进一步增加计算成本。用于增强现实/虚拟现实(AR/VR)的高效方法通常只使用人体关键点信息,但会因缺失场景上下文而导致准确率下降。本文描述了一种动作定位方法KeyNet,该方法仅利用关键点数据进行跟踪和动作识别。具体而言,KeyNet引入了基于目标的关键点信息来捕捉场景上下文。我们的方法展示了如何构建结构化中间表征,从而在无需任何RGB信息的情况下,从目标和人体关键点建模场景中的高阶交互。我们发现KeyNet能够以仅5帧/秒(FPS)的速度跟踪和分类人体动作。更重要的是,我们证明了在AVA动作和Kinetics数据集上,可以通过建模目标关键点来恢复因使用关键点信息而损失的上下文。