Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating multi-view features. Moreover, different poses of cameras present challenges in training global attention models. To address these problems, this paper proposes a divided view method, in which features are modeled globally via the visibility crossattention mechanism, but interact only with partial features in a divided local virtual space. This effectively reduces interference from other irrelevant features and alleviates the training difficulties of the transformer by decoupling the position embedding from camera poses. Additionally, 2D historical RoI features are incorporated into the object-centric temporal modeling to utilize highlevel visual semantic information. The model is trained using a one-to-many assignment strategy to facilitate stability. Our framework, named DVPE, achieves state-of-the-art performance (57.2% mAP and 64.5% NDS) on the nuScenes test set. Codes will be available at https://github.com/dop0/DVPE.
翻译:基于稀疏查询的范式在自动驾驶多视角三维检测领域已取得显著成功。当前研究在聚合多视角特征时,面临着扩大感受野与减少干扰之间的平衡挑战。此外,相机位姿的差异给全局注意力模型的训练带来了困难。为解决这些问题,本文提出一种分视角方法:通过可见性交叉注意力机制对特征进行全局建模,但仅在划分的局部虚拟空间中与部分特征交互。该方法有效减少了其他无关特征的干扰,并通过将位置嵌入与相机位姿解耦,缓解了Transformer的训练难度。此外,本文将二维历史感兴趣区域特征融入以目标为中心的时序建模,以利用高层视觉语义信息。模型采用一对多分配策略进行训练以提升稳定性。我们提出的DVPE框架在nuScenes测试集上实现了最先进的性能(57.2% mAP与64.5% NDS)。代码将在https://github.com/dop0/DVPE发布。