Despite radar's popularity in the automotive industry, for fusion-based 3D object detection, most existing works focus on LiDAR and camera fusion. In this paper, we propose TransCAR, a Transformer-based Camera-And-Radar fusion solution for 3D object detection. Our TransCAR consists of two modules. The first module learns 2D features from surround-view camera images and then uses a sparse set of 3D object queries to index into these 2D features. The vision-updated queries then interact with each other via transformer self-attention layer. The second module learns radar features from multiple radar scans and then applies transformer decoder to learn the interactions between radar features and vision-updated queries. The cross-attention layer within the transformer decoder can adaptively learn the soft-association between the radar features and vision-updated queries instead of hard-association based on sensor calibration only. Finally, our model estimates a bounding box per query using set-to-set Hungarian loss, which enables the method to avoid non-maximum suppression. TransCAR improves the velocity estimation using the radar scans without temporal information. The superior experimental results of our TransCAR on the challenging nuScenes datasets illustrate that our TransCAR outperforms state-of-the-art Camera-Radar fusion-based 3D object detection approaches.
翻译:摘要:尽管雷达在汽车工业中广泛应用,但在基于融合的三维目标检测领域,现有工作主要集中在激光雷达与摄像头的融合上。本文提出TransCAR,一种基于Transformer的摄像头-雷达融合三维目标检测解决方案。TransCAR包含两个模块:第一个模块从环绕视图摄像头图像中学习二维特征,随后利用稀疏的三维目标查询索引这些二维特征;经视觉更新的查询通过Transformer自注意力层相互交互。第二个模块从多帧雷达扫描中学习雷达特征,并应用Transformer解码器学习雷达特征与视觉更新查询之间的交互。Transformer解码器内的交叉注意力层能够自适应地学习雷达特征与视觉更新查询之间的软关联,而非仅依赖传感器标定的硬关联。最终,模型通过集合间的匈牙利损失为每个查询估计边界框,从而避免了非极大值抑制。TransCAR利用无需时间信息的雷达扫描改善了速度估计。在具有挑战性的nuScenes数据集上的优越实验结果证明,TransCAR优于现有最先进的摄像头-雷达融合三维目标检测方法。