Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.
翻译:视觉-语言-动作(VLA)模型在生成驾驶轨迹的同时会产生链式思维(CoT)推理,但现有基准仅评估轨迹质量,未检验CoT是否与驾驶动作相关、一致或存在因果关联。我们提出VLADriveBench框架,该框架将观测指标(提及度、幻觉率、矛盾性、动作对齐度)与CoT干预协议相结合,为CoT-动作关系提供互补视角。将VLADriveBench应用于两种架构下的三个模型后发现,两种分析可能出现显著分歧:ORION在观测对齐度上得分最高但其CoT仅为附带现象,而Alpamayo v1.5得分较低但其CoT具有强因果性,视觉显著性制约着CoT的影响程度。