Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
翻译:精确的深度信息对于提升多视角三维目标检测性能至关重要。尽管现有部分多视角三维检测器利用像素级深度监督取得了成功,但它们忽略了两类重要现象:1)从激光雷达点云获取的深度监督通常分布在物体表面,由于缺乏三维物体中心的深度信息,这对现有基于DETR的三维检测器不够友好;2)对于远距离物体,进行整体物体的细粒度深度估计更具挑战性。因此,我们认为对象级深度(即物体的三维中心)对精确检测至关重要。本文提出一种名为OPEN的新型多视角三维目标检测器,其核心思想是通过我们提出的对象级位置嵌入机制,将对象级深度信息有效注入网络。具体而言,我们首先采用以像素级深度图为先验的对象级深度编码器,以精确估计对象级深度。随后,利用提出的对象级位置嵌入将对象级深度信息编码至Transformer解码器,从而生成三维物体感知特征用于最终检测。大量实验验证了所提方法的有效性。此外,OPEN在nuScenes测试基准上取得了64.4% NDS与56.7% mAP的最新最优性能。