Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV.
翻译:基于BEV(鸟瞰图)空间的相机3D目标检测在过去几年引起了广泛关注。密集检测器通常采用两阶段流程:首先构建密集的BEV特征,然后在BEV空间中进行目标检测,这种方法面临复杂的视角变换和高计算成本问题。另一方面,稀疏检测器采用基于查询的范式,无需显式构建密集BEV特征,但性能不如密集检测器。本文发现,缩小这一性能差距的关键在于检测器在BEV空间和图像空间中的适应性。为此,我们提出SparseBEV——一种完全稀疏的3D目标检测器,其性能超越了密集检测器。SparseBEV包含三个关键设计:(1)尺度自适应自注意力,在BEV空间中以自适应感受野聚合特征;(2)自适应时空采样,在查询引导下生成采样位置;(3)自适应混合,利用查询生成的动态权重解码采样特征。在nuScenes测试集上,SparseBEV达到了67.5 NDS的最优性能。在验证集上,SparseBEV以23.5 FPS的实时推理速度实现了55.8 NDS。代码已开源至https://github.com/MCG-NJU/SparseBEV。