Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
翻译:自动驾驶需要一个精确且快速的3D感知系统,包括3D物体检测、跟踪和分割。尽管近期基于低成本相机的方法取得了令人瞩目的成果,但它们易受光照不足或恶劣天气条件的影响,且定位误差较大。因此,将相机与能够提供精确远距离测量并在所有环境下可靠运行的低成本雷达进行融合,虽然前景广阔,但尚未得到充分研究。本文提出相机-雷达融合网络(CRN),这是一种新颖的相机-雷达融合框架,可生成语义丰富且空间精确的鸟瞰视角(BEV)特征图,适用于多种任务。为克服图像空间信息不足的问题,我们利用稀疏但精确的雷达点,将透视视角图像特征变换至BEV空间。我们进一步使用多模态可变形注意力机制在BEV中聚合图像与雷达特征图,该机制专门设计用于解决输入间的空间错位问题。采用实时配置的CRN在nuScenes数据集上以20 FPS运行,其性能可与激光雷达检测器相媲美,甚至在100米远距离设置下表现更优。此外,采用离线配置的CRN在nuScenes测试集上取得了62.4%的NDS和57.5%的mAP,在所有相机及相机-雷达3D物体检测器中排名第一。