Geometric-aware Pretraining for Vision-centric 3D Object Detection

Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.

翻译：多摄像头三维目标检测是自动驾驶领域中的一个具有挑战性的问题，引起了学术界和工业界的显著关注。视觉技术面临的一个障碍是如何从RGB图像中精确提取具有几何意识的特征。近期方法利用在深度相关任务上预训练的几何感知图像主干网络来获取空间信息。然而，这些方法忽视了视角变换这一关键环节，导致因图像主干网络与视角变换之间的空间知识错位而性能不足。为解决此问题，我们提出了一种新颖的几何感知预训练框架GAPretrain。我们的方法通过在预训练阶段利用几何丰富模态作为引导，将空间和结构线索融入摄像头网络。跨模态传递模态特定属性并非易事，但我们通过采用统一的鸟瞰图（BEV）表示和从激光雷达点云提取的结构提示来弥合这一差距，从而促进预训练过程。GAPretrain作为一种即插即用的解决方案，可灵活应用于多种最先进的检测器。我们的实验证明了所提方法的有效性和泛化能力。使用BEVFormer方法在nuScenes验证集上，我们实现了46.2 mAP和55.5 NDS，分别提升了2.7和2.1个百分点。我们还对各种图像主干网络和视角变换进行了实验，以验证我们方法的功效。代码将在https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe 发布。