3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research.
翻译:多视图图像的3D目标检测在过去几年引起了广泛关注。现有方法主要从多视图图像构建3D表示,并采用密集检测头进行目标检测,或利用分布于3D空间的目标查询来定位物体。本文设计了多视图2D目标引导的3D目标检测器(MV2D),可将任意2D目标检测器提升至多视图3D目标检测。由于2D检测能为物体存在性提供有价值的先验信息,MV2D利用2D检测器生成以丰富图像语义为条件的目标查询。这些动态生成的查询帮助MV2D回忆视野中的物体,并展现出强大的3D目标定位能力。针对生成的查询,我们设计了稀疏交叉注意力模块,强制它们聚焦于特定物体的特征,从而抑制噪声干扰。在nuScenes数据集上的评估结果表明,动态目标查询与稀疏特征聚合可提升3D检测能力。MV2D在现有方法中也展现出最先进的性能。我们希望MV2D能为未来研究提供新的基线。