3D visual perception tasks based on multi-camera images are essential for autonomous driving systems. Latest work in this field performs 3D object detection by leveraging multi-view images as an input and iteratively enhancing object queries (object proposals) by cross-attending multi-view features. However, individual backbone features are not updated with multi-view features and it stays as a mere collection of the output of the single-image backbone network. Therefore we propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection where we update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. Firstly, we update multi-view features by multi-view axis self-attention. It will incorporate panoramic information in the multi-view features and enhance understanding of the global scene. Secondly, we update multi-view features by self-attention of the ROI (Region of Interest) windows which encodes local finer details in the features. It will help exchange the information not only along the multi-view axis but also along the other spatial dimension. Lastly, we leverage the fact of multi-representation of queries in different domains to further boost the performance. Here we use sparse floating queries along with dense BEV (Bird's Eye View) queries, which are later post-processed to filter duplicate detections. Moreover, we show performance improvements on nuScenes benchmark dataset on top of our baselines.
翻译:基于多摄像头图像的三维视觉感知任务对自动驾驶系统至关重要。该领域最新研究通过利用多视图图像作为输入,并采用交叉注意力机制迭代增强目标查询(目标候选框),从而融合多视图特征实现三维目标检测。然而,单骨干网络提取的特征并未与多视图特征进行交互更新,仍仅作为单图像骨干网络输出的简单集合。为此,我们提出3M3D:一种用于三维目标检测的多视图、多路径、多表征方法。该方法通过同时更新多视图特征与查询特征,分别在精细全景视图与粗略全局视图两个层面增强场景表征。首先,我们采用多视图轴向自注意力机制更新多视图特征,将全景信息融入多视图特征以增强全局场景理解。其次,通过感兴趣区域窗口的自注意力机制更新多视图特征,在特征中编码局部细节信息,促进信息不仅沿多视图轴向传递,还能沿其他空间维度进行交换。最后,我们利用查询在不同域中的多表征特性进一步优化性能:结合稀疏浮点查询与密集鸟瞰图查询,并通过后处理过滤重复检测结果。在nuScenes基准数据集上的实验表明,相较于基线方法,本方法取得了显著的性能提升。