Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. However, existing multi-camera algorithms primarily rely on monocular image pre-training, which overlooks the spatial and temporal correlations among different camera views. To address this limitation, we propose the first multi-camera unified pre-training framework called Occ-BEV, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, a 3D decoder is designed for leveraging Bird's Eye View (BEV) features from multi-view images to predict the 3D geometric occupancy to enable the model to capture a more comprehensive understanding of the 3D environment. A significant benefit of Occ-BEV is its capability of utilizing a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, Occ-BEV shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. Codes are publicly available at https://github.com/chaytonmin/Occ-BEV.
翻译:摘要:多相机3D感知已成为自动驾驶领域的重要研究方向,为基于激光雷达的方案提供了可行且经济高效的替代方案。然而,现有多相机算法主要依赖于单目图像预训练,忽略了不同相机视角之间的时空关联性。为解决这一局限,我们提出了首个名为Occ-BEV的多相机统一预训练框架,该框架首先将3D场景重建作为基础阶段,随后在下游任务中对模型进行微调。具体而言,我们设计了一个3D解码器,利用多视角图像的鸟瞰图(BEV)特征预测3D几何占用,使模型能够更全面地理解3D环境。Occ-BEV的一个显著优势是能够利用大量未标注的图像-激光雷达对进行预训练。所提出的多相机统一预训练框架在多相机3D目标检测和周围语义场景补全等关键任务中展现出有前景的结果。与nuScenes数据集上的单目预训练方法相比,Occ-BEV在多相机3D目标检测的mAP和NDS上分别提升了约2.0%和2.0%,同时在周围语义场景补全的mIoU上提升了3%。代码已公开在https://github.com/chaytonmin/Occ-BEV。