Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance and that, strikingly, removing depth estimation altogether does not degrade object detection performance. This suggests that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.
翻译:结合互补传感器模态对于安全关键型机器人应用(如自动驾驶)中提供稳健感知至关重要。当前最先进的自动驾驶相机-激光雷达融合方法依赖于单目深度估计,而与直接使用激光雷达深度信息相比,单目深度估计本身是一项极具挑战性的任务。本研究发现,该方法并未如预期般充分利用深度信息,且单纯改进深度估计并不能提升目标检测性能;引人注目的是,完全移除深度估计反而不会导致目标检测性能下降。这表明,在相机-激光雷达融合过程中,依赖单目深度估计可能构成不必要的架构瓶颈。本文提出一种新型融合方法,完全绕过单目深度估计,转而采用简单注意力机制在鸟瞰视角网格中选择并融合相机与激光雷达特征。实验证明,我们的模型可根据激光雷达特征的可用性调节对相机特征的使用,并在nuScenes数据集上实现了优于依赖单目深度估计基线的三维目标检测性能。