Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we propose the first multi-camera unified pre-training framework, called UniScene, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, we employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world through pre-training. A significant benefit of UniScene is its capability to utilize a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniScene.
翻译:多相机3D感知已成为自动驾驶领域的重要研究方向,为基于激光雷达的方案提供了可行且经济的替代方案。现有相机算法主要依赖单目2D预训练,然而这种预训练方式忽视了多相机系统在空间与时间上的关联性。为解决这一局限,我们首次提出名为UniScene的多相机统一预训练框架,该框架首先以3D场景重建作为基础阶段,随后在下游任务上对模型进行微调。具体而言,我们采用占用(Occupancy)作为3D场景的通用表示,使模型通过预训练掌握周围环境的几何先验信息。UniScene的一大优势在于能够利用大量无标注的图像-激光雷达数据对进行预训练。所提出的多相机统一预训练框架在多相机3D目标检测与周围语义场景补全等关键任务中展现出显著成效。与nuScenes数据集上的单目预训练方法相比,UniScene在多相机3D目标检测的mAP与NDS指标上均提升约2.0%,在周围语义场景补全的mIoU指标上提升3%。采用我们的统一预训练方法,可减少25%的3D训练标注成本,为实际自动驾驶部署提供重要实用价值。代码已开源至https://github.com/chaytonmin/UniScene。