Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.
翻译:视觉-语言-动作(VLA)驾驶通过启用语言的骨干网络增强了端到端(E2E)规划,但除了常见的准确性与成本权衡之外,其带来的变化尚不明确。我们通过在RecogDrive中实例化一个完整的VLM和纯视觉骨干网络系统(均采用相同的扩散Transformer规划器),并借助3-RQ分析重新审视了这一问题。RQ1:在骨干网络层面,VLM可以在纯视觉骨干网络的基础上引入额外的子空间。RQ2:这一独特子空间在某些长尾场景中导致了不同的行为模式:VLM倾向于更激进,而ViT则更保守,两者各自在约2-3%的测试场景中明显胜出;通过一个为每个场景在VLM和ViT分支间选择更优轨迹的预言机,我们获得了93.58 PDMS的上界。RQ3:为充分利用这一观察,我们提出了HybridDriveVLA,它同时运行ViT和VLM分支,并使用一个学习到的评分器在它们的终点轨迹之间进行选择,将PDMS提升至92.10。最后,DualDriveVLA实现了一种实用的快-慢策略:默认运行ViT,仅当评分器置信度低于阈值时才调用VLM;在15%的场景中调用VLM即可实现91.00 PDMS,同时将吞吐量提升3.2倍。代码将公开。