In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at https://github.com/phi-wol/sparc.
翻译:本文提出SpaRC,一种新颖的稀疏融合Transformer,用于整合多视角图像语义与雷达及相机点特征的三维感知任务。雷达与相机模态的融合已成为自动驾驶系统的高效感知范式。传统方法通常采用基于密集鸟瞰图的架构进行深度估计,而当代基于查询的Transformer通过以物体为中心的方法在纯相机检测中表现优异。然而,由于隐式深度建模的局限性,这些基于查询的方法在误检和定位精度方面存在不足。我们通过三项关键创新应对这些挑战:(1) 用于跨模态特征对齐的稀疏视锥融合;(2) 实现精确物体定位的距离自适应雷达聚合;(3) 用于聚焦查询聚合的局部自注意力机制。相较于现有需要高计算量BEV网格渲染的方法,SpaRC直接在编码点特征上操作,在效率与精度上均取得显著提升。在nuScenes与TruckScenes基准测试中的实证评估表明,SpaRC显著优于现有的密集BEV基与稀疏查询基检测器。本方法取得了67.1 NDS与63.1 AMOTA的先进性能指标。代码与预训练模型已发布于https://github.com/phi-wol/sparc。