Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git.
翻译:近年来,纯相机视角的鸟瞰图感知为经济型自动驾驶提供了可行方案。然而,现有基于鸟瞰图的多视图三维检测器通常将所有图像特征直接转换为鸟瞰图特征,未考虑背景信息占比过大可能淹没物体信息的问题。本文提出语义感知鸟瞰图池化,可根据图像特征的语义分割结果过滤背景信息,并将图像特征转换为语义感知的鸟瞰图特征。在此基础上,我们提出与语义感知鸟瞰图特征高度契合的有效数据增强策略BEV-Paste。此外,我们设计了多尺度跨任务头,该模块结合任务特定信息与跨任务信息,更精确地预测深度分布和语义分割,进一步提升了语义感知鸟瞰图特征的质量。最终,我们将上述模块集成至新型多视图三维物体检测框架SA-BEV中。在nuScenes数据集上的实验表明,SA-BEV达到了最先进性能。代码已开源至https://github.com/mengtan00/SA-BEV.git。