视觉隐式几何Transformer用于自动驾驶 (Visual Implicit Geometry Transformer for Autonomous Driving)

We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.

翻译：我们提出了视觉隐式几何Transformer（ViGT），这是一种用于自动驾驶的几何模型，能够从环视摄像头阵列估计连续的3D占据场。ViGT代表了迈向自动驾驶基础几何模型的一步，优先考虑可扩展性、架构简洁性以及跨不同传感器配置的泛化能力。我们通过一种无需标定的架构实现这一目标，使得单一模型能够适应不同的传感器设置。与专注于像素对齐预测的通用几何基础模型不同，ViGT在鸟瞰图（BEV）中估计连续的3D占据场，以满足特定领域的需求。ViGT能够自然地融合来自多个摄像头视角的几何信息到一个统一的度量坐标系中，为多种几何任务提供通用表示。与大多数现有的占据模型不同，我们采用了一种自监督训练流程，利用同步的图像-激光雷达数据对，从而避免了昂贵的人工标注。我们通过在五个大规模自动驾驶数据集（NuScenes、Waymo、NuPlan、ONCE和Argoverse）的混合数据上训练模型，验证了我们方法的可扩展性和泛化能力，并在点云地图估计任务上取得了最先进的性能，在所有评估基线中获得了最佳平均排名。我们进一步在Occ3D-nuScenes基准测试上评估了ViGT，其性能与有监督方法相当。源代码已在 \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT} 公开。