TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Qinwen Xu,Jiaming Liu,Rui Zhou,Shaojun Shi,Nuowei Han,Zhuoyang Liu,Chenyang Gu,Shuo Gu,Yang Yue,Gao Huang,Wenzhao Zheng,Sirui Han,Peng Jia,Shanghang Zhang

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.

翻译：尽管视觉-语言-动作（VLA）模型具备强大的泛化能力，但其仍受限于专家示范的高成本以及真实世界交互的不足。虽然在线强化学习（RL）在改进通用基础模型方面展现出潜力，但在真实场景中将RL应用于VLA操作时，仍面临探索效率低下和探索空间受限的挑战。通过系统性真实世界实验，我们观察到在线RL的有效探索空间与监督微调（SFT）的数据分布密切相关。受此启发，我们提出TwinRL——一种面向VLA模型的数字孪生-真实世界协同RL框架，旨在扩展并引导探索过程。首先，利用智能手机拍摄的场景高效重建高保真数字孪生，实现真实环境与模拟环境之间的逼真双向迁移。在SFT预热阶段，我们引入基于数字孪生的探索空间扩展策略，以拓宽数据轨迹分布的支撑集。基于这一增强的初始化，我们提出一种虚实引导的探索策略来进一步加速在线RL。具体而言，TwinRL在数字孪生环境中部署前即执行高效并行的在线RL，从而有效弥合离线与在线训练阶段之间的差距。随后，我们利用高效的数字孪生采样识别易失败但信息丰富的配置，以此引导真实机器人上针对性的人工参与滚动部署。实验表明，TwinRL在真实世界示范覆盖的分布内区域以及分布外区域均能达到接近100%的成功率，相比先前的真实世界RL方法实现至少30%的加速，且四个任务的平均耗时仅约20分钟。