Autonomous vehicles (AVs) must accurately detect objects from both common and rare classes for safe navigation, motivating the problem of Long-Tailed 3D Object Detection (LT3D). Contemporary LiDAR-based 3D detectors perform poorly on rare classes (e.g., CenterPoint only achieves 5.1 AP on stroller) as it is difficult to recognize objects from sparse LiDAR points alone. RGB images provide visual evidence to help resolve such ambiguities, motivating the study of RGB-LiDAR fusion. In this paper, we delve into a simple late-fusion framework that ensembles independently trained RGB and LiDAR detectors. Unlike recent end-to-end methods which require paired multi-modal training data, our late-fusion approach can easily leverage large-scale uni-modal datasets, significantly improving rare class detection.In particular, we examine three critical components in this late-fusion framework from first principles, including whether to train 2D or 3D RGB detectors, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections.Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy than 3D RGB detectors, matching on the 2D image plane mitigates depth estimation errors, and fusing scores probabilistically with calibration leads to state-of-the-art LT3D performance. Our late-fusion approach achieves 51.4 mAP on the established nuScenes LT3D benchmark, improving over prior work by 5.9 mAP.
翻译:自动驾驶车辆必须准确检测常见及稀有类别的物体以确保安全导航,这引出了长尾三维目标检测问题。当前基于激光雷达的三维检测器在稀有类别上表现不佳(例如CenterPoint仅能在婴儿车上达到5.1的AP),因为仅凭稀疏的激光雷达点难以识别物体。RGB图像提供了视觉证据以帮助解决此类模糊性,从而推动了RGB-激光雷达融合的研究。本文深入探讨了一种简单的后期融合框架,该框架集成独立训练的RGB和激光雷达检测器。与需要配对多模态训练数据的近期端到端方法不同,我们的后期融合方法能轻松利用大规模单模态数据集,显著提升稀有类别检测性能。特别地,我们从基本原理出发研究了该后期融合框架中的三个关键组件:是否训练2D或3D RGB检测器、是否在3D空间或投影的2D图像平面上匹配RGB和激光雷达检测结果,以及如何融合匹配后的检测结果。大量实验表明,2D RGB检测器的识别准确率优于3D RGB检测器,在2D图像平面上进行匹配可减轻深度估计误差,而通过校准进行概率性得分融合则能实现最先进的LT3D性能。我们的后期融合方法在成熟的nuScenes LT3D基准上达到51.4 mAP,较先前工作提升5.9 mAP。