Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

翻译：在本文中，我们提出GTA-VLA（引导、思考、行动）框架，这是一种交互式视觉-语言-动作（VLA）框架，通过允许用户以明确的视觉线索引导机器人策略，实现了空间可引导的具身推理。现有的VLA模型学习从多模态观测到机器人动作的直接“感知-行动”映射。虽然在训练分布内有效，但这种紧密耦合的策略在分布外（OOD）偏移下表现脆弱，且在失败发生时难以纠正。尽管近期具身思维链（CoT）方法暴露了中间推理过程，但它们仍缺乏纳入人类空间引导的机制，限制了其解决视觉歧义或从错误中恢复的能力。为解决这一空白，我们的框架允许用户可选地以空间先验（如可交互点、边界框和轨迹）引导策略，后续推理过程可直接基于这些输入进行条件推断。基于这些输入，模型生成统一的视觉-空间思维链，将外部引导与内部任务规划相结合，使人类视觉意图与自主决策对齐。为便于实际部署，我们进一步将推理模块与轻量级反应式动作头耦合，以实现高效的动作执行。大量实验证明了我们方法的有效性。在域内SimplerEnv WidowX基准测试中，我们的框架实现了81.2%的最优成功率。在OOD视觉偏移和空间歧义条件下，单次视觉交互相较现有方法显著提升了任务成功率，凸显了交互式推理在具身控制中错误恢复的价值。项目详情参见：https://signalispupupu.github.io/GTA-VLA_ProjPage/