RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice, low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However, relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically, we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based encoder and a transformer-based encoder are proposed to extract radar features, with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides, we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism, and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.

翻译：三维目标检测是自动驾驶中的关键任务之一。为降低实际应用成本，业界提出采用低成本多视角相机替代昂贵的激光雷达传感器进行三维目标检测。然而，仅依赖相机难以实现高精度和高鲁棒性的三维目标检测。解决该问题的有效方案是将多视角相机与经济的毫米波雷达传感器相结合，实现更可靠的多模态三维目标检测。本文提出RCBEVDet，一种基于鸟瞰图（BEV）的雷达-相机融合三维目标检测方法。具体而言，我们首先设计了用于雷达BEV特征提取的RadarBEVNet，该网络包含双流雷达主干网络和雷达散射截面（RCS）感知的BEV编码器。在双流雷达主干中，通过点云编码器和Transformer编码器提取雷达特征，并采用注入与提取模块促进两个编码器的信息交互；RCS感知的BEV编码器则将RCS作为目标尺寸先验，用于在BEV空间散射点云特征。此外，我们提出交叉注意力多层融合模块，利用可变形注意力机制自动对齐来自雷达和相机的多模态BEV特征，并通过通道与空间融合层实现特征融合。实验结果表明，RCBEVDet在nuScenes和view-of-delft（VoD）三维目标检测基准上取得了最新的雷达-相机融合最优结果。同时，RCBEVDet以21~28 FPS的推理速度，超越了所有实时纯相机及雷达-相机三维目标检测器，获得更优的三维检测效果。源代码将开源在https://github.com/VDIGPKU/RCBEVDet。