Detection and tracking of moving objects is an essential component in environmental perception for autonomous driving. In the flourishing field of multi-view 3D camera-based detectors, different transformer-based pipelines are designed to learn queries in 3D space from 2D feature maps of perspective views, but the dominant dense BEV query mechanism is computationally inefficient. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-sparse detector that incorporates sparse queries, sparse attention with box-wise sampling, and sparse prediction. SRCN3D adopts a cascade structure with the twin-track update of both a fixed number of query boxes and latent query features. Our novel sparse feature sampling module only utilizes local 2D region of interest (RoI) features calculated by the projection of 3D query boxes for further box refinement, leading to a fully-convolutional and deployment-friendly pipeline. For multi-object tracking, motion features, query features and RoI features are comprehensively utilized in multi-hypotheses data association. Extensive experiments on nuScenes dataset demonstrate that SRCN3D achieves competitive performance in both 3D object detection and multi-object tracking tasks, while also exhibiting superior efficiency compared to transformer-based methods. Code and models are available at https://github.com/synsin0/SRCN3D.
翻译:摘要:运动目标的检测与跟踪是自动驾驶环境感知中的关键组成部分。在蓬勃发展的多视角三维基于摄像头的检测器领域中,研究人员设计了多种基于Transformer的流水线,以从透视视角的二维特征图中学习三维空间中的查询,但主流的密集BEV查询机制计算效率较低。本文提出了一种新颖的两阶段全稀疏检测器——稀疏R-CNN 3D (SRCN3D),它融合了稀疏查询、基于框采样的稀疏注意力机制以及稀疏预测。SRCN3D采用级联结构,同步更新固定数量的查询框和潜在查询特征。我们提出的新颖稀疏特征采样模块仅利用通过三维查询框投影计算得到的局部二维感兴趣区域(RoI)特征,用于进一步优化边界框,从而构建出全卷积且易于部署的流水线。针对多目标跟踪,运动特征、查询特征和RoI特征被综合用于多假设数据关联。在nuScenes数据集上的大量实验表明,SRCN3D在三维目标检测和多目标跟踪任务中均达到了具有竞争力的性能,同时相较于基于Transformer的方法展现出更优越的效率。代码与模型已开源至https://github.com/synsin0/SRCN3D。