In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP.
翻译:在3D人体动作识别中,有限的监督数据使得充分挖掘Transformer等强大网络的建模潜力面临挑战。为此,研究人员一直在积极探索有效的自监督预训练策略。本研究表明,不同于在人体关节点进行掩码自组件重建的流行前置任务,显式的上下文运动建模才是学习3D动作识别有效特征表示的关键。具体而言,我们提出掩码运动预测框架(MAMP)。该框架以掩码时空骨架序列为输入,预测对应掩码人体关节点的时间运动信息。考虑到骨架序列的高度时间冗余性,MAMP将运动信息作为经验性语义丰富度先验来指导掩码过程,从而促进模型更关注语义丰富的时间区域。在NTU-60、NTU-120和PKU-MMD数据集上的大量实验表明,所提出的MAMP预训练方法显著提升了基础Transformer的性能,无需任何花哨技巧即取得了最先进的结果。我们的MAMP源代码已开源至https://github.com/maoyunyao/MAMP。