The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1% AP$_{BEV}$ and +3.4% AP$_{3D}$ on nuScenes, and +8.3% AP$_{BEV}$ and +7.4% AP$_{3D}$ on Lyft compared to existing techniques.
翻译:无监督三维目标检测旨在无显式监督信号的条件下,在非结构化环境中精确检测目标。该任务通常基于稀疏的激光雷达点云,但由于点云固有的稀疏性和有限的空间分辨率,在检测远距离或小尺寸目标时性能往往受限。本文是早期尝试将激光雷达数据与二维图像融合用于无监督三维检测的工作之一,并提出了一种名为LiDAR-2D自步学习的新方法。我们认为RGB图像可作为激光雷达数据的有价值补充,尤其在部分目标对应的激光雷达点云稀缺时,能够提供精确的二维定位线索。考虑到两种模态的独特特性,本框架设计了一种包含自适应采样与弱模型聚合策略的自步学习流程。自适应采样策略在训练过程中动态调整伪标签的分布,以抑制模型对易检测样本的过拟合倾向。通过这种方式,确保了模型在不同目标尺度与距离上的均衡学习轨迹。弱模型聚合模块整合了在不同伪标签分布下训练的模型优势,最终形成一个鲁棒且强大的最终模型。实验评估验证了所提LiSe方法的有效性,在nuScenes数据集上实现了+7.1% AP$_{BEV}$与+3.4% AP$_{3D}$的显著提升,在Lyft数据集上实现了+8.3% AP$_{BEV}$与+7.4% AP$_{3D}$的提升,均优于现有技术。