We propose a novel semi-supervised active learning (SSAL) framework for monocular 3D object detection with LiDAR guidance (MonoLiG), which leverages all modalities of collected data during model development. We utilize LiDAR to guide the data selection and training of monocular 3D detectors without introducing any overhead in the inference phase. During training, we leverage the LiDAR teacher, monocular student cross-modal framework from semi-supervised learning to distill information from unlabeled data as pseudo-labels. To handle the differences in sensor characteristics, we propose a data noise-based weighting mechanism to reduce the effect of propagating noise from LiDAR modality to monocular. For selecting which samples to label to improve the model performance, we propose a sensor consistency-based selection score that is also coherent with the training objective. Extensive experimental results on KITTI and Waymo datasets verify the effectiveness of our proposed framework. In particular, our selection strategy consistently outperforms state-of-the-art active learning baselines, yielding up to 17% better saving rate in labeling costs. Our training strategy attains the top place in KITTI 3D and birds-eye-view (BEV) monocular object detection official benchmarks by improving the BEV Average Precision (AP) by 2.02.
翻译:我们提出了一种新颖的半监督主动学习框架(MonoLiG),用于基于激光雷达引导的单目三维目标检测,该框架在模型开发过程中充分利用了所有模态的采集数据。我们利用激光雷达来引导单目三维检测器的数据选择和训练过程,且不会在推理阶段引入任何额外开销。训练时,我们从半监督学习中借鉴了激光雷达教师与单目学生的跨模态框架,将未标注数据中的信息蒸馏为伪标签。为应对传感器特性的差异,我们提出了一种基于数据噪声的加权机制,以降低从激光雷达模态向单目模态传播噪声的影响。在选取需标注样本来提升模型性能时,我们设计了一种与训练目标一致的基于传感器一致性的选择评分指标。在KITTI和Waymo数据集上的大量实验结果验证了所提框架的有效性。具体而言,我们的选择策略始终优于最先进的主动学习基线方法,在标注成本上实现了高达17%的节省率。我们的训练策略在KITTI官方单目三维目标检测及鸟瞰图(BEV)单目目标检测基准测试中均取得了最优成绩,将BEV平均精度(AP)提升了2.02。