The recent trend for multi-camera 3D object detection is through the unified bird's-eye view (BEV) representation. However, directly transforming features extracted from the image-plane view to BEV inevitably results in feature distortion, especially around the objects of interest, making the objects blur into the background. To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features. Such features contain information about the object's position and 3D structures. First, we explicitly guide the network to learn the depth distribution by object-level supervision from each 3D object's center. Then, we select the foreground pixels by a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset to validate the merits of our proposed OA-BEV. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score. Our codes will be published.
翻译:多相机三维目标检测的最新趋势是通过统一的鸟瞰图(BEV)表示。然而,直接将图像平面视角提取的特征变换到BEV会不可避免地导致特征失真,尤其是在目标物体周围,使得目标模糊融入背景。为此,我们提出OA-BEV,一种可嵌入BEV三维目标检测框架的网络,通过融合物体感知的伪三维特征和深度特征来突出目标物体。这些特征包含物体的位置和三维结构信息。首先,我们通过每个三维目标中心的物体级监督显式引导网络学习深度分布。然后,利用二维目标检测器选取前景像素,并将其投影到三维空间进行伪体素特征编码。最后,通过可变形注意力机制将物体感知的深度特征和伪体素特征融入BEV表示。我们在nuScenes数据集上进行了大量实验,验证了所提OA-BEV的优势。我们的方法在平均精度和nuScenes检测分数上均优于基于BEV的基线方法。相关代码将公开发布。