Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.
翻译:近年来,基于掩码的骨架重构模型已成为强大的动作表征学习器,推动了自监督骨架动作识别的重大进展。然而,现有最先进方法必须预测极大数量的时空块,显著延长了训练时间。此外,这些模型在重构过程中平等对待所有时空区域,导致其分散了对动作语义中关键运动模式的学习。为解决这些挑战,我们提出自适应掩码重构(AMR),一种更快更强的预训练框架。首先将解码器与编码器解耦,使得可以灵活预测更大的时空块,并大幅降低重构复杂度。由于更大的块包含更复杂的信息,难以预测并因此降低性能,我们相应地引入自适应引导模块。该模块识别高运动信息量的区域,引导模型聚焦于每个块中最具判别性的部分,并缓解重构难度。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明,AMR不仅显著加速预训练,还提升了下游识别准确率,超越了当前最先进方法。