Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
翻译:尽管视觉-语言-动作(VLA)模型展现出强大的泛化能力,但其发展仍受限于专家示范的高昂成本以及与现实世界交互的不足。在线强化学习(RL)在提升通用基础模型方面已显示出潜力,然而,将其应用于现实场景下的VLA操作任务,仍面临探索效率低下和探索空间受限的挑战。通过系统的现实世界实验,我们观察到在线RL的有效探索空间与监督微调(SFT)阶段的数据分布密切相关。受此启发,我们提出了TwinRL,一个数字孪生-现实世界协同的强化学习框架,旨在为VLA模型扩展并引导探索。首先,我们利用智能手机捕捉的场景高效重建了一个高保真数字孪生,实现了真实与模拟环境之间逼真的双向迁移。在SFT预热阶段,我们引入了一种利用数字孪生扩展探索空间的策略,以拓宽数据轨迹分布的支撑集。基于此增强的初始化,我们提出了一种从模拟到现实的引导探索策略,以进一步加速在线RL。具体而言,TwinRL在部署前于数字孪生中进行高效并行的在线RL,有效弥合了离线与在线训练阶段之间的鸿沟。随后,我们利用高效的数字孪生采样来识别易失败但信息丰富的配置,并以此引导在真实机器人上进行有针对性的、人机协同的轨迹采样。在我们的实验中,TwinRL在现实世界示范覆盖的分布内区域以及分布外区域均实现了接近100%的成功率,相比先前的现实世界RL方法至少提速30%,且在四项任务上平均仅需约20分钟。