LiAuto-GeoX: Efficient Grounded Driving Transformer

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

翻译：密集三维重建在空间理解方面展现出巨大潜力，但其作为自动驾驶实时车载表示方案的可行性仍是一项开放挑战。现有的大规模视觉几何模型通常需要大量计算资源，且缺乏动态驾驶环境所需的长距离几何保真度、环视一致性以及实时效率。为弥补这一差距，我们提出\textbf{LiAuto-GeoX}——一种专为可部署的、以自我为中心的三维场景理解而设计的高效接地驾驶Transformer。我们的方法首先从大规模环视数据中学习高容量驾驶几何模型，利用稀疏激光雷达先验在远处、模糊或结构稀疏区域提供稳健的几何基准。随后，我们通过一种新颖的几何保持蒸馏框架，将该能力实例化为一个高度紧凑的1.55亿参数车载模型。该框架采用掩码引导的深度感知蒸馏，通过强调几何信息丰富的区域来保留精细的度量结构；并采用相对位姿关系蒸馏，通过位姿诱导的几何关系强制实现跨视角空间一致性。大量评估表明，\textbf{LiAuto-GeoX} 在KITTI数据集上以220 FPS运行，同时保持高保真度密集重建，实现实时部署。所学习的几何特征无缝迁移到下游自主任务中，在轨迹预测中达到90.6 PDMS，在占用预测中达到24.63 mIoU，在未来帧预测中达到47.67 IoU。这些结果共同证明，高效的密集三维重建能够超越其作为感知目标的传统角色，作为下一代自动驾驶的可扩展基础几何表示。