TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Qinwen Xu,Jiaming Liu,Rui Zhou,Shaojun Shi,Nuowei Han,Zhuoyang Liu,Chenyang Gu,Shuo Gu,Yang Yue,Gao Huang,Wenzhao Zheng,Sirui Han,Peng Jia,Shanghang Zhang

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.

翻译：尽管视觉-语言-动作（VLA）模型展现出强大的泛化能力，但其仍受限于专家演示的高昂成本以及真实世界交互的不足。虽然在线强化学习（RL）在改进通用基础模型方面显示出潜力，但将其应用于真实世界环境下的VLA操作任务，仍受限于探索效率低下和探索空间受限。通过系统的真实世界实验，我们观察到在线RL的有效探索空间与监督微调（SFT）的数据分布密切相关。受此观察启发，我们提出了TwinRL，一个数字孪生-真实世界协同的RL框架，旨在为VLA模型扩展并引导探索。首先，我们利用智能手机捕获的场景高效重建了一个高保真数字孪生，实现了真实与仿真环境之间逼真的双向迁移。在SFT预热阶段，我们引入了一种利用数字孪生的探索空间扩展策略，以拓宽数据轨迹分布的支撑集。基于此增强的初始化，我们提出了一种仿真到真实的引导探索策略，以进一步加速在线RL。具体而言，TwinRL在部署前于数字孪生中进行高效并行的在线RL，有效弥合了离线与在线训练阶段之间的差距。随后，我们利用高效的数字孪生采样来识别易失败但信息丰富的配置，这些配置被用于引导在真实机器人上进行有针对性的人机协同交互。在我们的实验中，TwinRL在真实世界演示覆盖的分布内区域和分布外区域均接近100%的成功率，相比先前的真实世界RL方法至少提速30%，并且在四项任务上平均仅需约20分钟。