Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

Zhenghao "Mark" Peng,Wenhao Ding,Yurong You,Yuxiao Chen,Wenjie Luo,Thomas Tian,Yulong Cao,Apoorva Sharma,Danfei Xu,Boris Ivanovic,Boyi Li,Bolei Zhou,Yan Wang,Marco Pavone

Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, and then performs counterfactual reasoning conditioned on both the meta-actions and the visual context. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflective capabilities, we propose a rollout-filter-label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6%, enhances safety metrics by 20.5%, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.

翻译：近期，基于推理增强的视觉-语言-行动模型通过生成中间推理轨迹，提升了端到端自动驾驶的可解释性。然而，这些模型主要描述其感知内容与预期行动，很少质疑其规划的动作是否安全或恰当。本文提出反事实视觉-语言-行动模型，这是一种自反思的视觉-语言-行动框架，使模型能够在执行前对其规划动作进行推理与修正。该模型首先生成时间分段的元动作以概括驾驶意图，随后基于元动作与视觉上下文进行反事实推理。此步骤模拟潜在结果、识别不安全行为，并输出修正后的元动作以指导最终轨迹生成。为高效获得此类自反思能力，我们提出一种“推演-筛选-标注”流程，从基础（非反事实）视觉-语言-行动模型的推演中挖掘高价值场景，并为后续训练轮次标注反事实推理轨迹。在大规模驾驶数据集上的实验表明，反事实视觉-语言-行动模型将轨迹精度提升最高达17.6%，安全指标提高20.5%，并展现出自适应思维特性：仅在挑战性场景中启用反事实推理。通过将推理轨迹从单次描述转化为因果自校正信号，反事实视觉-语言-行动模型向具备“三思而后行”能力的自反思自动驾驶智能体迈出了关键一步。