UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction

Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we propose the first multi-camera unified pre-training framework, called UniScene, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, we employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world through pre-training. A significant benefit of UniScene is its capability to utilize a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniScene.

翻译：多相机三维感知已成为自动驾驶领域的一个突出研究方向，提供了基于激光雷达方案以外的一种可行且经济高效的替代方案。现有的大多数多相机算法主要依赖于单目二维预训练。然而，单目二维预训练忽略了多相机系统中存在的空间与时间相关性。为了解决这一局限性，我们提出了首个多相机统一预训练框架——UniScene，其核心思路是首先将重建三维场景作为基础阶段，随后在下游任务上对模型进行微调。具体地，我们采用Occupancy作为三维场景的通用表示，使模型能够通过预训练掌握周围环境的几何先验知识。UniScene的一个显著优势在于，它能够利用大量无标注的图像-激光雷达数据对进行预训练。该多相机统一预训练框架在多相机三维目标检测与周围语义场景补全等关键任务上展现了良好的结果。与nuScenes数据集上的单目预训练方法相比，UniScene在多相机三维目标检测方面实现了约2.0%的mAP提升和2.0%的NDS提升，同时在周围语义场景补全任务上，mIoU提升了3%。通过采用我们的统一预训练方法，可以节省25%的三维训练标注成本，为实际自动驾驶的实现提供了重要的实用价值。代码已开源在https://github.com/chaytonmin/UniScene。