LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Hao Chen,Jiaming Liu,Zhonghao Yan,Nuowei Han,Renrui Zhang,Chenyang Gu,Jialin Gao,Ziyu Guo,Siyuan Qian,Yinxi Wang,Peng Jia,Chi-Wing Fu,Shanghang Zhang,Pheng-Ann Heng

Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.

翻译：视觉-语言-动作模型通过融入推理机制在复杂机器人操作任务中取得进展。然而，现有方法均存在关键局限：无论是采用存在延迟和离散化问题的显式语言推理，还是利用更具表现力的连续潜在推理，它们主要局限于静态模仿学习，导致适应性和泛化能力受限。虽然在线强化学习已被引入VLA模型以支持试错探索，但当前方法仅优化原始动作空间，忽略了底层物理推理过程。本文提出统一框架\textbf{LaST-R1}，在动作执行前集成对物理动力学的潜在思维链推理，并结合定制化强化学习后训练范式。具体而言，我们提出新型强化学习算法\textbf{潜在-动作策略优化(LAPO)}，该算法联合优化潜在推理过程与动作生成。通过桥接推理与控制，LAPO提升了物理世界建模的表征能力，增强了交互环境中的鲁棒性。此外，引入\textbf{自适应潜在思维链机制}，使策略能够根据环境复杂度动态调整推理深度。大量实验表明：在LIBERO基准测试中，LaST-R1仅需单次有监督预热即可达到近乎完美的99.8%平均成功率，显著优于先前最优方法的收敛速度与性能。在真实场景部署中，针对四项复杂任务（包含单臂与双臂操作），经LAPO后训练的策略相比初始预热策略性能提升高达44%。最后，LaST-R1在仿真与真实环境间展现出强泛化能力。