End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
翻译:端到端自动驾驶已从基于稀疏感知的传统范式发展为视觉-语言-动作(VLA)模型,其核心在于将语言描述学习作为辅助任务以促进规划。本文提出一种替代性的视觉-几何-动作(VGA)范式,主张将稠密三维几何作为自动驾驶的关键线索。由于车辆运行于三维世界,我们认为稠密三维几何可为决策提供最全面的信息。然而,现有几何重建方法(如DVGT)依赖计算成本高昂的多帧批量处理,难以应用于在线规划。为此,我们提出流式驾驶视觉几何变换器(DVGT-2),该方法以在线方式处理输入,并联合输出当前帧的稠密几何与轨迹规划。我们采用时序因果注意力机制并缓存历史特征以支持实时推理;同时提出滑窗流式策略,利用一定间隔内的历史缓存避免重复计算。虽速度更快,但DVGT-2在多个数据集上仍取得了更优的几何重建性能。训练后的同一DVGT-2模型可直接应用于不同相机配置下的规划任务(包括闭环NAVSIM和开环nuScenes基准测试),无需微调。