3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research. Code is available at \url{https://github.com/tusen-ai/MV2D}.
翻译:多视角图像的3D目标检测在过去数年间备受关注。现有方法主要通过多视角图像构建3D表征,并采用密集检测头进行目标检测,或利用分布在3D空间中的目标查询定位物体。本文设计了多视角2D目标引导的3D目标检测器(MV2D),可将任意2D目标检测器提升至多视角3D目标检测。由于2D检测能够为目标存在性提供有价值的先验信息,MV2D利用2D检测器生成基于丰富图像语义的目标查询。这些动态生成的查询有助于MV2D召回视野内的目标,并展现出强大的3D目标定位能力。针对生成的查询,我们设计了稀疏交叉注意力模块,强制其聚焦于特定目标的特征,从而抑制噪声干扰。在nuScenes数据集上的评估结果表明,动态目标查询与稀疏特征聚合能够提升3D检测性能。MV2D在现有方法中也展现出最先进的性能。我们期望MV2D能为未来研究提供新的基准框架。代码开源地址为:\url{https://github.com/tusen-ai/MV2D}。