We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data. Previous methods commonly conduct representation learning at the clip or frame level and cannot well capture fine-grained semantics. Instead of contrasting the representations of clips or frames, in this paper, we propose a unified self-supervised framework by conducting contrastive learning at the point level. Moreover, we introduce a new pretext task by achieving semantic alignment of superpoints, which further facilitates the representations to capture semantic cues at multiple scales. In addition, due to the high redundancy in the temporal dimension of dynamic point clouds, directly conducting contrastive learning at the point level usually leads to massive undesired negatives and insufficient modeling of positive representations. To remedy this, we propose a selection strategy to retain proper negatives and make use of high-similarity samples from other instances as positive supplements. Extensive experiments show that our method outperforms supervised counterparts on a wide range of downstream tasks and demonstrates the superior transferability of the learned representations.
翻译:我们提出了一种统一的点云视频自监督学习框架,适用于以物体为中心和以场景为中心的数据。现有方法通常在片段或帧级别进行表征学习,难以捕获细粒度语义。本文通过在点级别进行对比学习,提出了一种统一的自监督框架。此外,我们引入了一种新的预文本任务,通过实现超点的语义对齐,进一步促进表征在多个尺度上捕获语义线索。针对动态点云时间维度存在高度冗余的问题,直接在点级别进行对比学习通常会导致大量非必要负样本以及正样本表征建模不足。为此,我们提出一种选择策略,保留合适的负样本,并利用其他实例中的高相似性样本作为正样本补充。大量实验表明,我们的方法在多种下游任务中优于有监督方法,并展示了所学表征的卓越迁移能力。