As a pioneering work, PointContrast conducts unsupervised 3D representation learning via leveraging contrastive learning over raw RGB-D frames and proves its effectiveness on various downstream tasks. However, the trend of large-scale unsupervised learning in 3D has yet to emerge due to two stumbling blocks: the inefficiency of matching RGB-D frames as contrastive views and the annoying mode collapse phenomenon mentioned in previous works. Turning the two stumbling blocks into empirical stepping stones, we first propose an efficient and effective contrastive learning framework, which generates contrastive views directly on scene-level point clouds by a well-curated data augmentation pipeline and a practical view mixing strategy. Second, we introduce reconstructive learning on the contrastive learning framework with an exquisite design of contrastive cross masks, which targets the reconstruction of point color and surfel normal. Our Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively. It accelerates the pre-training procedure by at least 3x and still achieves an uncompromised performance compared with previous work. Besides, MSC also enables large-scale 3D pre-training across multiple datasets, which further boosts the performance and achieves state-of-the-art fine-tuning results on several downstream tasks, e.g., 75.5% mIoU on ScanNet semantic segmentation validation set.
翻译:作为先驱性工作,PointContrast通过利用原始RGB-D帧上的对比学习进行无监督三维表示学习,并在多种下游任务中证明了其有效性。然而,三维领域的大规模无监督学习趋势尚未形成,这归因于两个障碍:将RGB-D帧作为对比视图进行匹配的低效性,以及先前工作中提到的恼人的模式坍塌现象。将这两个障碍转化为经验性垫脚石,我们首先提出一种高效且有效的对比学习框架,该框架通过精心设计的数据增强流程和实用的视图混合策略,直接在场景级点云上生成对比视图。其次,我们在对比学习框架中引入重构学习,并精巧设计对比跨掩模,旨在重构点的颜色和面元法线。我们的掩蔽场景对比(MSC)框架能够更高效且有效地提取全面的三维表示。与先前工作相比,它使预训练过程加速至少3倍,同时仍能实现不妥协的性能。此外,MSC还支持跨多个数据集的大规模三维预训练,这进一步提升了性能,并在多个下游任务中取得了最先进的微调结果,例如在ScanNet语义分割验证集上达到75.5%的mIoU。