Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, \underline{O}ne \underline{B}ounding Box \underline{M}ultiple \underline{O}bjects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ \textbf{mAP in BEV} and $\mathbf{1.18\sim 9.36\%}$ \textbf{mAP in 3D}). Codes have been released at \url{https://github.com/mrsempress/OBMO}.
翻译:相较于典型的多传感器系统,单目三维目标检测因其配置简单而备受关注。然而,基于LiDAR的方法与基于单目的方法之间仍存在显著差距。本文发现单目图像的病态特性会导致深度模糊性:具体而言,不同深度的物体可能在二维图像中呈现相同的边界框和相似视觉特征。遗憾的是,网络无法从这类无判别性的视觉特征中准确区分不同深度,导致深度训练不稳定。为促进深度学习,我们提出一种简单而有效的即插即用模块——单框多物体(OBMO)。具体地,通过沿视锥方向平移三维边界框添加一组合适的伪标签。为约束伪三维标签的合理性,我们精心设计两种标签评分策略以表征其质量。与原始硬深度标签相比,这种带有质量分数的软伪标签使网络能够学习合理的深度区间,提升训练稳定性并最终改善性能。在KITTI和Waymo基准上的大量实验表明,我们的方法显著提升了最先进的单目三维检测器性能(在KITTI验证集中等设置下,BEV视角mAP提升$\mathbf{1.82\sim 10.91\%}$,3D视角mAP提升$\mathbf{1.18\sim 9.36\%}$)。代码已开源在\url{https://github.com/mrsempress/OBMO}。