Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize object centers, and then predict 3D attributes by neighboring features. However, only using local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that captures object appearances, we introduce to predict a foreground depth map, and specialize a depth encoder to extract non-local depth embeddings. Then, we formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations. Besides, our depth-guided modules can also be plug-and-play to enhance multi-view 3D object detectors on nuScenes dataset, demonstrating our superior generalization capacity. Code is available at https://github.com/ZrrSkywalker/MonoDETR.
翻译:单目3D目标检测长期以来一直是自动驾驶中的一项挑战性任务。现有方法大多遵循传统2D检测器,首先定位目标中心,然后通过邻域特征预测3D属性。然而,仅使用局部视觉特征难以理解场景级3D空间结构,且忽略了远距离目标间的深度关系。本文首次提出基于深度引导Transformer的单目检测框架——MonoDETR。我们改进原始Transformer使其具备深度感知能力,并通过上下文深度线索引导整个检测过程。具体而言,与捕捉目标外观的视觉编码器并行,我们引入前景深度图预测,并特化一个深度编码器提取非局部深度嵌入。随后,将3D目标候选表示为可学习查询,并设计深度引导解码器实现目标-场景深度交互。通过此方式,每个目标查询自适应地从图像中深度引导区域估计其3D属性,不再受限于局部视觉特征。在KITTI基准测试中,以单目图像作为输入,MonoDETR实现了最先进性能,且无需额外密集深度标注。此外,我们的深度引导模块可作为即插即用组件增强nuScenes数据集上的多视图3D检测器,展现出卓越的泛化能力。代码开源于https://github.com/ZrrSkywalker/MonoDETR。