Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. However, existing multi-camera algorithms primarily rely on monocular image pre-training, which overlooks the spatial and temporal correlations among different camera views. To address this limitation, we propose a novel multi-camera unified pre-training framework called Occ-BEV, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, a 3D decoder is designed for leveraging Bird's Eye View (BEV) features from multi-view images to predict the 3D geometry occupancy to enable the model to capture a more comprehensive understanding of the 3D environment. One significant advantage of Occ-BEV is that it can utilize a vast amount of unlabeled image-LiDAR pairs for pre-training. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, Occ-BEV demonstrates a significant improvement of 2.0% in mAP and 2.0% in NDS for 3D object detection, as well as a 0.8% increase in mIOU for semantic scene completion. codes are publicly available at https://github.com/chaytonmin/Occ-BEV.
翻译:多相机3D感知已成为自动驾驶领域的重要研究方向,提供了基于激光雷达方案之外一种可行且成本更低的替代方案。然而,现有的大多数多相机算法主要依赖单目图像预训练,忽视了不同相机视角之间的空间和时间相关性。为解决这一局限,我们提出了一种名为Occ-BEV的新型多相机统一预训练框架,其核心思路是首先以3D场景重建作为基础阶段,随后在下游任务中对模型进行微调。具体而言,我们设计了一个3D解码器,利用多视角图像中的鸟瞰图(BEV)特征来预测3D几何占用情况,从而使模型能够更全面地理解3D环境。Occ-BEV的一个重要优势在于可利用大量无标注的图像-激光雷达对进行预训练。所提出的多相机统一预训练框架在诸如多相机3D目标检测与语义场景补全等关键任务中展现出显著效果。与nuScenes数据集上的单目预训练方法相比,Occ-BEV在3D目标检测任务的mAP和NDS指标上均实现了2.0%的提升,并在语义场景补全的mIOU指标上提升了0.8%。相关代码已公开发布于https://github.com/chaytonmin/Occ-BEV。