In this paper, we improve the challenging monocular 3D object detection problem with a general semi-supervised framework. Specifically, having observed that the bottleneck of this task lies in lacking reliable and informative samples to train the detector, we introduce a novel, simple, yet effective `Augment and Criticize' framework that explores abundant informative samples from unlabeled data for learning more robust detection models. In the `Augment' stage, we present the Augmentation-based Prediction aGgregation (APG), which aggregates detections from various automatically learned augmented views to improve the robustness of pseudo label generation. Since not all pseudo labels from APG are beneficially informative, the subsequent `Criticize' phase is presented. In particular, we introduce the Critical Retraining Strategy (CRS) that, unlike simply filtering pseudo labels using a fixed threshold (e.g., classification score) as in 2D semi-supervised tasks, leverages a learnable network to evaluate the contribution of unlabeled images at different training timestamps. This way, the noisy samples prohibitive to model evolution could be effectively suppressed. To validate our framework, we apply it to MonoDLE and MonoFlex. The two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with remarkable improvements for over 3.5% AP_3D/BEV (Easy) on KITTI, showing its effectiveness and generality. Code and models will be released.
翻译:本文提出一种通用半监督框架,旨在提升具有挑战性的单目三维目标检测性能。通过观察该任务瓶颈在于缺乏可靠且信息丰富的样本来训练检测器,我们引入一种新颖、简洁且高效的"增强与批判"框架,从无标注数据中挖掘大量信息样本,以学习更鲁棒的检测模型。在"增强"阶段,我们提出基于增强的预测聚合方法(APG),通过聚合多种自动学习增强视角的检测结果,提升伪标签生成的鲁棒性。由于APG产生的伪标签并非都具有有益信息,后续"批判"阶段应运而生。具体而言,我们引入关键重训练策略(CRS),与二维半监督任务中采用固定阈值(如分类分数)简单过滤伪标签的做法不同,该方法利用可学习网络评估不同训练时刻无标注图像的贡献。通过这种方式,阻碍模型演进的噪声样本可被有效抑制。为验证该框架,我们将其应用于MonoDLE与MonoFlex。两种新检测器命名为3DSeMo_DLE和3DSeMo_FLEX,在KITTI数据集上以超过3.5%的AP_3D/BEV(简单)指标提升,取得了最先进结果,验证了方法的有效性与泛化性。代码与模型将开源。