Autonomous vehicles (AVs) must accurately detect objects from both common and rare classes for safe navigation, motivating the problem of Long-Tailed 3D Object Detection (LT3D). Contemporary LiDAR-based 3D detectors perform poorly on rare classes (e.g., CenterPoint only achieves 5.1 AP on stroller) as it is difficult to recognize objects from sparse LiDAR points alone. RGB images provide visual evidence to help resolve such ambiguities, motivating the study of RGB-LiDAR fusion. In this paper, we delve into a simple late-fusion framework that ensembles independently trained RGB and LiDAR detectors. Unlike recent end-to-end methods which require paired multi-modal training data, our late-fusion approach can easily leverage large-scale uni-modal datasets, significantly improving rare class detection. In particular, we examine three critical components in this late-fusion framework from first principles, including whether to train 2D or 3D RGB detectors, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections.Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy than 3D RGB detectors, matching on the 2D image plane mitigates depth estimation errors, and fusing scores probabilistically with calibration leads to state-of-the-art LT3D performance. Our late-fusion approach achieves 51.4 mAP on the established nuScenes LT3D benchmark, improving over prior work by 5.9 mAP.
翻译:自动驾驶车辆必须准确检测常见及罕见类别的物体以确保安全导航,这催生了长尾3D目标检测问题。当前基于LiDAR的3D检测器在罕见类别上表现不佳(例如CenterPoint对婴儿车的平均精度仅5.1),因为仅凭稀疏的LiDAR点云难以识别物体。RGB图像提供视觉证据有助于解决此类歧义,由此推动了RGB-LiDAR融合研究。本文深入探讨了一种简单的后期融合框架,该框架集成独立训练的RGB和LiDAR检测器。与需要配对多模态训练数据的端到端方法不同,我们的后期融合方法可轻松利用大规模单模态数据集,显著提升罕见类别检测性能。具体而言,我们从基本原理出发,研究了该后期融合框架中的三个关键组件:选择2D还是3D RGB检测器训练、在3D空间还是投影2D图像平面进行检测匹配、以及如何融合匹配后的检测结果。大量实验表明:2D RGB检测器的识别精度优于3D RGB检测器,在2D图像平面进行匹配可减轻深度估计误差,而通过校准进行概率分数融合则达到最优的长尾3D检测性能。我们的后期融合方法在nuScenes长尾3D检测基准上实现51.4平均精度,相较此前方法提升5.9平均精度。