Vision-Language-Action (VLA) models have recently shown strong generalization, with some approaches seeking to explicitly generate linguistic reasoning traces or predict future observations prior to execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST$_0$, a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST$_0$ adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST$_0$ is trained with heterogeneous operation frequencies, enabling adaptive switching during deployment. Across 10 real-world tasks spanning tabletop, mobile, and dexterous hand manipulation, LaST$_0$ improves mean success rates by 13%, 14% and 14% over prior SOTA VLA methods, respectively.
翻译:视觉-语言-动作(VLA)模型近期展现出强大的泛化能力,部分方法试图在执行前显式生成语言推理轨迹或预测未来观测。然而,显式推理通常会产生不可忽视的推理延迟,这限制了机器人操作所需的时间分辨率。此外,此类推理被限制在语言空间内,形成了表征瓶颈,难以忠实捕捉难以言喻的物理属性。为缓解这些限制,我们提出LaST$_0$框架,该框架通过潜在时空思维链(CoT)实现执行前的高效推理,能够捕捉通常难以用语言描述的细粒度物理与机器人动力学特性。具体而言,我们构建了一个令牌高效的潜在CoT空间,该空间建模未来视觉动态、三维结构信息及机器人本体感知状态,并沿时间维度扩展这些表征以实现时间一致的隐式推理轨迹。此外,LaST$_0$采用通过混合Transformer架构实现的双系统设计:推理专家执行低频潜在推理,而动作专家则基于面向机器人的潜在表征生成高频动作。为实现系统协同,LaST$_0$采用异构操作频率进行训练,从而在部署时支持自适应切换。在涵盖桌面操作、移动操控及灵巧手操作的10项真实世界任务中,LaST$_0$相较于现有最先进VLA方法,平均成功率分别提升13%、14%和14%。