Multi-view 3D object detection is a crucial component of autonomous driving systems. Contemporary query-based methods primarily depend either on dataset-specific initialization of 3D anchors, introducing bias, or utilize dense attention mechanisms, which are computationally inefficient and unscalable. To overcome these issues, we present MDHA, a novel sparse query-based framework, which constructs adaptive 3D output proposals using hybrid anchors from multi-view, multi-scale image input. Fixed 2D anchors are combined with depth predictions to form 2.5D anchors, which are projected to obtain 3D proposals. To ensure high efficiency, our proposed Anchor Encoder performs sparse refinement and selects the top-$k$ anchors and features. Moreover, while existing multi-view attention mechanisms rely on projecting reference points to multiple images, our novel Circular Deformable Attention mechanism only projects to a single image but allows reference points to seamlessly attend to adjacent images, improving efficiency without compromising on performance. On the nuScenes val set, it achieves 46.4\% mAP and 55.0\% NDS with a ResNet101 backbone. MDHA significantly outperforms the baseline where anchor proposals are modelled as learnable embeddings. Code is available at https://github.com/NaomiEX/MDHA.
翻译:多视角三维目标检测是自动驾驶系统的关键组成部分。当前基于查询的方法主要依赖于数据集特定的三维锚点初始化(会引入偏差),或采用密集注意力机制(计算效率低且难以扩展)。为克服这些问题,我们提出MDHA——一种新颖的基于稀疏查询的框架,该框架利用多视角、多尺度图像输入的混合锚点构建自适应的三维输出提案。通过将固定二维锚点与深度预测相结合形成2.5维锚点,再将其投影获取三维提案。为确保高效性,我们提出的锚点编码器执行稀疏优化,并筛选前$k$个锚点及其特征。此外,现有多视角注意力机制需将参考点投影至多幅图像,而我们提出的循环可变形注意力机制仅需投影至单幅图像,同时允许参考点无缝关注相邻图像,在保持性能的同时显著提升效率。在nuScenes验证集上,采用ResNet101骨干网络可实现46.4% mAP与55.0% NDS。MDHA显著优于将锚点提案建模为可学习嵌入向量的基线方法。代码发布于https://github.com/NaomiEX/MDHA。