基于现成图像基础模型的跨模态预训练用于三维目标检测 (Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection)

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such 3D data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only, RGB-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings. Our code is available at https://github.com/meharkhurana03/cm3d

翻译：最先进的三维目标检测器通常在大规模标注数据集上进行训练。然而，三维边界框的标注仍然极其昂贵且耗时，尤其对于激光雷达数据而言。相反，近期研究表明，利用未标注数据进行自监督预训练可以在有限标注条件下提升检测精度。现有方法将图像领域自监督学习的最佳实践（例如对比学习）适配到点云数据。然而，公开可用的三维数据集在规模和多样性上远不及用于图像自监督学习的数据集，这限制了其有效性。我们注意到，此类三维数据通常以多模态方式采集，常与图像数据配对。我们认为，与其仅使用自监督目标进行预训练，不如利用基于互联网规模数据训练的图像基础模型来引导点云表征学习。具体而言，我们提出一种基于现成图像基础模型的监督方法，用于从配对的RGB和激光雷达数据生成零样本三维边界框。使用此类伪标签预训练三维检测器，在半监督检测精度上显著优于先前的自监督预训练任务。重要的是，我们证明了基于图像的现成监督有助于训练仅使用激光雷达、仅使用RGB以及多模态（RGB+激光雷达）的检测器。我们在nuScenes和WOD数据集上验证了方法的有效性，在有限数据场景下相比先前工作取得了显著提升。代码发布于https://github.com/meharkhurana03/cm3d