Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is particularly important for semantic segmentation tasks involving 3D datasets, which are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on unlabelled data is one way to reduce the amount of manual annotations needed. Previous work has focused on pre-training with point clouds exclusively. While useful, this approach often requires two or more registered views. In the present work, we combine image and point cloud modalities by first learning self-supervised image features and then using these features to train a 3D model. By incorporating image data, which is often included in many 3D datasets, our pre-training method only requires a single scan of a scene and can be applied to cases where localization information is unavailable. We demonstrate that our pre-training approach, despite using single scans, achieves comparable performance to other multi-scan, point cloud-only methods.
翻译:减少监督训练所需的标注数量,在标签稀缺且成本高昂的情况下至关重要。这种减少对于涉及三维数据集的语义分割任务尤为重要——这类数据集通常规模更小且标注难度远超基于图像的同类数据集。对无标注数据进行自监督预训练是降低人工标注需求的有效途径之一。现有研究主要聚焦于仅使用点云进行预训练,但这种方法往往需要两个或以上配准视角。本研究通过融合图像与点云模态:首先学习自监督图像特征,然后利用这些特征训练三维模型。由于许多三维数据集本身包含图像数据,我们的预训练方法仅需单次场景扫描即可实施,且适用于缺乏定位信息的场景。实验表明,尽管仅采用单次扫描,该预训练方法在性能上与需要多次扫描的纯点云方法相当。