Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
翻译:自动驾驶需要包含3D目标检测、跟踪与分割的精确且快速的3D感知系统。尽管近期低成本基于相机的方法展现了有潜力的结果,但易受光照不足或恶劣天气影响且定位误差较大。因此,将相机与低成本雷达(可提供精确远距离测量并在所有环境下可靠运行)融合具有前景但尚未被深入研究。本文提出相机-雷达融合框架CRN(Camera Radar Net),可生成语义丰富且空间精确的鸟瞰图特征图以支持多种任务。为克服图像空间信息缺失,我们借助稀疏但精确的雷达点将透视视角图像特征转换至BEV空间。进一步利用多模态可变形注意力机制在BEV空间中聚合图像与雷达特征图,以解决输入间的空间错位问题。实时配置的CRN在nuScenes数据集上以20FPS运行并达到与激光雷达检测器相当的性能,在100米远距设置下甚至更优。此外,离线配置的CRN在nuScenes测试集上取得62.4% NDS与57.5% mAP,位列所有相机及相机-雷达3D目标检测器之首。