Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.

翻译：本文同时解决了传统基于骨架的动作识别中存在的三个局限：骨架检测与跟踪误差、目标动作种类匮乏，以及逐人与逐帧的动作识别问题。我们将点云深度学习范式引入动作识别，提出了一种统一框架及名为“结构化关键点池化”的新型深度神经网络架构。该方法基于数据结构（骨架固有的）的先验知识（如每个关键点所属的实例与帧），以级联方式稀疏聚合关键点特征，并实现了对输入误差的鲁棒性。其约束较少且无需跟踪的架构，使人形骨架与非人物体轮廓构成的时间序列关键点能够被高效处理为输入三维点云，从而扩展了目标动作的多样性。此外，我们受结构化关键点池化启发，提出了“池化-切换技巧”。该技巧在训练与推理阶段切换池化核，仅利用视频级动作标签，以弱监督方式检测逐人与逐帧动作。该技巧使我们的训练方案能自然引入新型数据增强方法（混合从不同视频中提取的多个点云）。实验中，我们全面验证了所提方法针对上述局限的有效性，且该方法在性能上超越了当前最先进的基于骨架的动作识别与时空动作定位方法。