End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
翻译:端到端自动驾驶已从基于稀疏感知的传统范式转向视觉-语言-动作(VLA)模型,后者聚焦于学习语言描述作为辅助任务以促进规划。本文提出一种替代性视觉-几何-动作(VGA)范式,主张稠密三维几何作为自动驾驶的关键线索。由于车辆运行于三维世界,我们认为稠密三维几何为决策提供了最全面的信息。然而,现有几何重建方法(如DVGT)大多依赖计算成本高昂的多帧输入批处理,无法应用于在线规划。为此,我们引入流式驾驶视觉几何变换器(DVGT-2),它以在线方式处理输入,并联合输出当前帧的稠密几何与轨迹规划。我们采用时序因果注意力机制并缓存历史特征以支持实时推理。为进一步提升效率,我们提出滑动窗口流式策略,利用特定间隔内的历史缓存来避免重复计算。尽管速度更快,DVGT-2在多个数据集上仍取得了优异的几何重建性能。同一训练好的DVGT-2可直接应用于不同相机配置下的规划任务,无需微调,涵盖闭环NAVSIM与开环nuScenes基准。