Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA's intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA's own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA's natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks and scales with compute and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla/
翻译:推理型视觉-语言-动作模型通过在执行底层动作前生成逐步文本规划来提升机器人指令跟随能力,该方法受语言模型中思维链推理的启发。然而,即使文本规划正确,生成的动作仍可能偏离规划中的预期结果,尤其在分布外场景中。我们将此现象形式化为具身思维链忠实性的缺失,并提出一种无需训练、运行时策略引导的推理-动作对齐方法。给定推理型视觉-语言-动作模型的中间文本规划,本框架从同一模型中采样多个候选动作序列,通过仿真预测其执行结果,并利用预训练视觉-语言模型选择其结果与模型自身文本规划最匹配的动作序列。仅执行与文本推理一致的动作序列,将基础视觉-语言-动作模型固有的动作多样性从误差来源转化为优势,从而增强对语义与视觉分布外扰动的鲁棒性,并实现无需昂贵重新训练的新行为组合。我们还构建了LIBERO-100的推理标注扩展版本、专为分布外评估定制的环境变体,实验表明在行为组合任务上相较现有工作最高可获得15%的性能提升,且性能随计算资源与数据多样性扩展。项目网站:https://yilin-wu98.github.io/steering-reasoning-vla/