Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense BEV (Bird's Eye View)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU.
翻译:低成本的视觉中心三维感知系统近年来在自动驾驶领域取得了显著进展,逐步缩小了与基于昂贵激光雷达方法之间的差距。成为完全可靠的替代方案的主要挑战在于稳健的深度预测能力,因为基于摄像头的系统在长检测范围及恶劣光照与天气条件下表现不佳。本工作中,我们提出HyDRa,一种面向多种三维感知任务的新型摄像头-雷达融合架构。基于密集BEV(鸟瞰图)架构的原则,HyDRa引入了一种混合融合方法,在两种不同的表示空间中结合互补的摄像头与雷达特征。我们的高度关联Transformer模块在透视视图中利用雷达特征,以生成更稳健且准确的深度预测。在BEV中,我们通过雷达加权深度一致性进一步优化初始稀疏表示。HyDRa在公共nuScenes数据集上的摄像头-雷达融合任务中达到了新的最优性能,NDS为64.2(提升1.8),AMOTA为58.4(提升1.5)。此外,我们新生成的语义丰富且空间精确的BEV特征可直接转换为强大的占用表示,在Occ3D基准测试中以3.7 mIoU的显著优势超越了所有先前的基于摄像头的方法。