The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
翻译:自监督预训练范式通过对比学习,在基于骨架的动作识别领域学习三维动作表示方面取得了巨大成功。然而,针对基于骨架的时序动作定位学习有效表示仍然具有挑战性且研究不足。与视频级别的动作识别不同,检测动作边界需要具有时间敏感性的特征,以捕捉标签发生变化的相邻帧之间的细微差异。为此,我们构建了一个用于自监督预训练的片段判别前置任务,该任务将骨架序列密集地投影到非重叠的片段中,并通过对比学习增强能够区分不同视频间这些片段的特征。此外,我们在基于骨架的动作识别模型的强大骨干网络基础上,通过一个U形模块融合中间特征,以增强用于帧级定位的特征分辨率。我们的方法在BABEL数据集的各种子集和评估协议上,持续改进了现有基于骨架的对比学习方法在动作定位任务上的性能。通过在NTU RGB+D和BABEL上进行预训练,我们在PKUMMD数据集上也实现了最先进的迁移学习性能。