DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Does Chain-of-Thought (CoT) reasoning genuinely improve Vision-Language-Action (VLA) models, or does it merely add overhead? Existing CoT-VLA systems report limited and inconsistent gains, yet no prior work has rigorously diagnosed when and why CoT helps robots act. Through systematic experiments, we identify two necessary conditions that must be jointly satisfied for CoT to be effective in VLA: (1) Decoding Alignment -- CoT and actions must be generated with modality-appropriate mechanisms; forcing both through a single autoregressive decoder is not merely suboptimal but actively harmful, degrading performance by 4.2 percentage points; (2) Causal Alignment -- CoT must be causally linked to task success via outcome-based optimization; without it, supervised CoT is indistinguishable from no reasoning at all under distribution shift, exhibiting a 32.0\,pp performance drop nearly identical to the 31.6\,pp drop of a reasoning-free baseline. Guided by these findings, we build DeepThinkVLA: a hybrid-attention decoder satisfies Condition~1 by pairing causal attention for language with bidirectional attention for parallel action decoding, while a two-stage SFT-then-RL pipeline satisfies Condition~2 by aligning the full reasoning--action chain with sparse task-success rewards. DeepThinkVLA achieves 97.0\% success on LIBERO, 79.0\% robustness on LIBERO-Plus (vs.\ 61.6\% for $π_0$-FAST), and 59.3\% success on RoboTwin~2.0, exceeding the strongest baseline by 21.7 points. Furthermore, we validate the practical effectiveness of our approach through real-world robot experiments. Code available at https://github.com/OpenBMB/DeepThinkVLA

翻译：链式思维推理是否真正提升了视觉-语言-动作模型，抑或仅仅增加了额外开销？现有CoT-VLA系统的性能提升有限且不稳定，但尚无先例严格诊断链式思维在何时及为何有助于机器人行动。通过系统性实验，我们识别出链式思维在VLA中生效必须同时满足的两个必要条件：(1) 解码对齐——链式思维与动作需通过模态适配机制生成；强行将两者通过单一自回归解码器处理不仅非最优，反而有害，导致性能下降4.2个百分点；(2) 因果对齐——链式思维必须通过基于结果的优化与任务成功建立因果关联；缺乏这一条件时，有监督链式思维在分布偏移下与无推理无异，性能下降32.0个百分点，几乎等同于无推理基线的31.6个百分点下降。基于上述发现，我们构建了DeepThinkVLA：混合注意力解码器通过将语言因果注意力与双向并行动作解码配对满足条件1；两阶段SFT-then-RL流水线通过将完整推理-动作链与稀疏任务成功奖励对齐满足条件2。DeepThinkVLA在LIBERO上达到97.0%成功率，在LIBERO-Plus上达到79.0%鲁棒性（对比π₀-FAST的61.6%），在RoboTwin 2.0上达到59.3%成功率，超越最强基线21.7个百分点。此外，我们通过真实机器人实验验证了本方法的实际有效性。代码地址：https://github.com/OpenBMB/DeepThinkVLA