LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

Zhuoyang Liu,Jiaming Liu,Hao Chen,Jiale Yu,Ziyu Guo,Chengkai Hou,Chenyang Gu,Xiangju Mi,Renrui Zhang,Kun Wu,Zhengping Che,Jian Tang,Pheng-Ann Heng,Shanghang Zhang

from arxiv, Project page: https://vla-last0.github.io/

Vision-Language-Action (VLA) models have recently shown strong generalization, with some approaches seeking to explicitly generate linguistic reasoning traces or predict future observations prior to execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST$_0$, a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST$_0$ adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST$_0$ is trained with heterogeneous operation frequencies, enabling adaptive switching during deployment. Across 10 real-world tasks spanning tabletop, mobile, and dexterous hand manipulation, LaST$_0$ improves mean success rates by 13%, 14% and 14% over prior SOTA VLA methods, respectively.

翻译：视觉-语言-动作（VLA）模型近期展现出强大的泛化能力，部分方法旨在显式生成语言推理轨迹或在执行前预测未来观测。然而，显式推理通常会引入不可忽略的推理延迟，限制了机器人操作所需的时间分辨率。此外，此类推理局限于语言空间，造成表征瓶颈，难以准确捕获难以言喻的物理属性。为缓解这些局限，我们提出LaST$_0$框架，通过潜在时空思维链（CoT）在执行前实现高效推理，捕捉难以用语言描述的细粒度物理与机器人动力学特征。具体而言，我们引入一个令牌高效的潜在CoT空间，用于建模未来视觉动态、3D结构信息及机器人本体感受状态，并进一步跨时间扩展这些表征以生成时序一致的隐式推理轨迹。此外，LaST$_0$采用通过混合Transformer架构实现的双系统设计：推理专家执行低频潜在推理，动作专家基于面向机器人的潜在表征生成高频动作。为促进协调，LaST$_0$采用异构操作频率进行训练，实现在部署阶段的自适应切换。在涵盖桌面操作、移动操作及灵巧手操作的10项真实世界任务中，LaST$_0$的平均成功率较此前SOTA的VLA方法分别提升13%、14%和14%。