Continuous Reasoning for Vision-Language-Action

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

翻译：自然语言是语言模型和视觉-语言模型强大的推理媒介，但它在粒度上与连续控制不匹配。文本和显式子目标以任务级别的粒度运作，而视觉-语言-动作（VLA）策略必须在更精细的时间尺度上选择动作；因此，单个推理步骤可能跨越多个动作块，同时与当前所需的动作保持弱耦合。这为VLA提出了一个不同的问题：什么应该扮演语言的角色？我们认为，一个有用的VLA推理媒介必须在模型实例之间可共享，通过下游动作改进可验证，并与时间扩展的控制结构对齐。基于这一观点，我们提出了面向视觉-语言-动作的连续推理。我们的模型首先以结构化连续思维集的形式预测连续推理，然后将其重用为块结构动作生成的共享上下文。更好的动作预测本身并不保证良好的推理：如果相同的内部媒介不能在模型实例之间共享，并且不能通过改进的下游控制得到独立验证，那么增加的隐变量可能仅仅成为模型私有的捷径，有助于处理已见行为，但无法支持泛化的控制。因此，我们将连续推理实例化为一个共享的高斯隐变量接口，并使用自验证目标进行训练，其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验表明，连续推理提升了LIBERO-PRO的鲁棒性，并在真实机器人上表现强劲，在TX-G2（一种与AgiBot G2兼容的变体）上，平均子任务成功率比π0.5提高了40.4%，在HSR上提高了26.3%。这表明，VLA中的推理与其说是关于额外的令牌，不如说是关于一种可共享、可验证的动作内部语言。