Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.
翻译:仿真为丰富视觉-语言-动作模型的训练提供了一种可扩展且低成本的途径,减少了对昂贵真实机器人演示数据的依赖。然而,大多数仿真-现实协同训练方法依赖于监督微调,其将仿真视为静态演示数据源,未能利用大规模闭环交互。因此,现实世界的性能增益与泛化能力往往受限。本文提出一种基于强化学习的仿真-现实协同训练框架,在利用交互式仿真的同时保持现实世界能力。我们的方法遵循通用的两阶段设计:首先通过混合真实与仿真演示数据对策略进行监督微调预热,随后在仿真环境中通过强化学习进行微调,同时引入真实数据的辅助监督损失以锚定策略并缓解灾难性遗忘。我们在四种真实桌面操作任务上,使用OpenVLA与$π_{0.5}$两种代表性VLA架构评估本框架,相较于纯现实微调与基于监督微调的协同训练方法均取得稳定提升,其中OpenVLA现实任务成功率提升24%,$π_{0.5}$提升20%。除成功率提升外,强化学习协同训练还展现出对未见任务变体更强的泛化能力,并显著提高了现实数据利用效率,为借助仿真增强真实机器人部署提供了实用且可扩展的技术路径。