Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.
翻译:多摄像头三维目标检测是自动驾驶领域的一项挑战性问题,已引起学术界和工业界的广泛关注。基于视觉的技术面临的一个障碍在于如何从RGB图像中精确提取几何感知特征。近期方法采用在深度相关任务上预训练的几何感知图像骨干网络来获取空间信息。然而,这些方法忽视了视角变换这一关键环节,导致图像骨干网络与视角变换之间的空间知识存在错配,从而性能欠佳。为解决此问题,我们提出一种新型几何感知预训练框架GAPretrain。该方法在预训练阶段借助几何丰富模态作为引导,向摄像机网络注入空间与结构线索。不同模态间的模态专属属性迁移并非易事,但我们通过统一鸟瞰图表示及源自激光雷达点云的结构线索来弥合这一鸿沟,从而优化预训练过程。GAPretrain作为即插即用方案,可灵活应用于多种最先进的检测器。实验证明该方法具有出色的有效性与泛化能力。基于BEVFormer方法,我们在nuScenes验证集上实现了46.2 mAP与55.5 NDS,分别提升2.7和2.1个百分点。我们还在多种图像骨干网络与视角变换方案上开展实验,验证了本方法的优越性。代码将于https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe 开源。