The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: https://github.com/nullmax-vision/SimPB.
翻译:自动驾驶领域对直接从多相机鸟瞰图(BEV)推理3D物体的方法产生了浓厚兴趣。部分研究尝试利用单张图像中的2D检测器来提升3D检测性能,但这些方法依赖两阶段独立检测器,且2D检测结果仅用于单次令牌选择或查询初始化。本文提出一个名为SimPB的统一模型,可同时从多相机透视视图检测2D目标并在BEV空间中检测3D目标。为此,我们设计了一个混合解码器,包含多个多视图2D解码层和多个3D解码层,分别专用于各自检测任务。我们提出了动态查询分配模块和自适应查询聚合模块,以循环的3D-2D-3D方式持续更新并优化2D与3D结果的交互。此外,利用查询组注意力机制增强各相机组内2D查询的交互。在nuScenes数据集上的实验表明,该方法在2D和3D检测任务中均取得显著成果。我们的代码已开源:https://github.com/nullmax-vision/SimPB。