Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.
翻译:基于鸟瞰图的方法近年来在多视角三维检测任务中取得了显著进展。与鸟瞰图方法相比,稀疏方法在性能上虽有差距,但仍具备诸多不可忽视的优点。为进一步推动稀疏三维检测的发展,本文提出一种名为Sparse4D的新方法,该方法通过稀疏采样与融合时空特征,对锚框进行迭代优化。(1)稀疏四维采样:为每个三维锚点分配多个四维关键点,这些关键点随后投影至多视角/尺度/时间戳的图像特征中,以采样对应特征;(2)层级特征融合:我们层级融合不同视角/尺度、不同时间戳以及不同关键点的采样特征,以生成高质量的实例特征。通过这种方式,Sparse4D无需依赖密集视图变换或全局注意力机制,即可高效且有效地实现三维检测,同时更适用于边缘设备部署。此外,我们引入实例级深度重加权模块,以缓解三维到二维投影中的病态问题。实验表明,在nuScenes数据集上的检测任务中,我们的方法优于所有基于稀疏的方法,并超越大多数基于鸟瞰图的方法。