With the advent of the big model era, the demand for data has become more important. Especially in monocular 3D object detection, expensive manual annotations potentially limit further developments. Existing works have investigated weakly supervised algorithms with the help of LiDAR modality to generate 3D pseudo labels, which cannot be applied to ordinary videos. In this paper, we propose a novel paradigm, termed as BA$^2$-Det, leveraging the idea of global-to-local 3D reconstruction for 2D supervised monocular 3D object detection. Specifically, we recover 3D structures from monocular videos by scene-level global reconstruction with global bundle adjustment (BA) and obtain object clusters by the DoubleClustering algorithm. Learning from completely reconstructed objects in global BA, GBA-Learner predicts pseudo labels for occluded objects. Finally, we train an LBA-Learner with object-centric local BA to generalize the generated 3D pseudo labels to moving objects. Experiments on the large-scale Waymo Open Dataset show that the performance of BA$^2$-Det is on par with the fully-supervised BA-Det trained with 10% videos and even outperforms some pioneer fully-supervised methods. We also show the great potential of BA$^2$-Det for detecting open-set 3D objects in complex scenes. The code will be made available. Project page: https://ba2det.site .
翻译:随着大模型时代的到来,数据的需求变得愈发重要。尤其是在单目三维目标检测中,昂贵的人工标注可能限制其进一步发展。现有研究借助激光雷达模态生成三维伪标签的弱监督算法,无法适用于普通视频。本文提出一种全新范式,称为BA$^2$-Det,利用全局到局部三维重建的思想实现二维监督的单目三维目标检测。具体而言,我们通过场景级全局重建(结合全局光束法平差)从单目视频中恢复三维结构,并利用DoubleClustering算法获得目标聚类。通过从全局BA中完全重建的目标,GBA-Learner为遮挡目标预测伪标签。最后,我们训练一个基于目标中心局部BA的LBA-Learner,将生成的三维伪标签泛化到运动目标。在大型Waymo Open Dataset上的实验表明,BA$^2$-Det的性能与使用10%视频训练的完全监督BA-Det相当,甚至优于部分先驱性的完全监督方法。我们还展示了BA$^2$-Det在复杂场景中检测开放集三维目标的巨大潜力。代码将开源。项目主页:https://ba2det.site 。