End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
翻译:端到端自动驾驶已从基于稀疏感知的传统范式演变为视觉-语言-行为(VLA)模型,该模型侧重于学习语言描述作为辅助任务以促进规划。本文提出一种替代性的视觉-几何-行为(VGA)范式,主张将稠密三维几何作为自动驾驶的关键线索。由于车辆在三维世界中运行,我们认为稠密三维几何为决策提供了最全面的信息。然而,现有的大多数几何重建方法(如DVGT)依赖于计算开销较大的多帧输入批处理,且无法应用于在线规划。为解决这一问题,我们提出一种流式驾驶视觉几何Transformer(DVGT-2),该模型以在线方式处理输入,并联合输出当前帧的稠密几何与轨迹规划。我们采用时序因果注意力机制并缓存历史特征以支持实时推理。为进一步提升效率,我们提出一种滑动窗口流式策略,并利用一定间隔内的历史缓存以避免重复计算。尽管速度更快,DVGT-2在多个数据集上仍实现了更优的几何重建性能。经相同训练的DVGT-2可直接应用于不同相机配置下的规划任务而无需微调,包括闭环NAVSIM与开环nuScenes基准测试。