Recently, the community has made tremendous progress in developing effective methods for point cloud video understanding that learn from massive amounts of labeled data. However, annotating point cloud videos is usually notoriously expensive. Moreover, training via one or only a few traditional tasks (e.g., classification) may be insufficient to learn subtle details of the spatio-temporal structure existing in point cloud videos. In this paper, we propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations. MaST-Pre is based on spatio-temporal point-tube masking and consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture the appearance information of point cloud videos. Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube. In this way, MaST-Pre is forced to model the spatial and temporal structure in point cloud videos. Extensive experiments on MSRAction-3D, NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed method.
翻译:近年来,社区在开发从大量标注数据中学习点云视频理解的有效方法方面取得了巨大进展。然而,标注点云视频通常成本高昂。此外,通过单一或少数传统任务(例如分类)进行训练,可能不足以学习点云视频中存在的时空结构的细微细节。本文提出了一种掩码时空结构预测方法(MaST-Pre),以在无需人工标注的情况下捕捉点云视频的结构。MaST-Pre基于时空点管掩码,包含两个自监督学习任务。首先,通过重建掩码点管,我们的方法能够捕捉点云视频的外观信息。其次,为学习运动信息,我们提出了一项时间基数差异预测任务,该任务估计点管内点数量的变化。通过这种方式,MaST-Pre被迫对点云视频中的空间和时间结构进行建模。在MSRAction-3D、NTU-RGBD、NvGesture和SHREC'17数据集上的大量实验证明了所提方法的有效性。