Recent advances in vision-language-action (VLA) models have motivated the extension of their capabilities to embodied settings, where reinforcement learning (RL) offers a principled way to optimize task success through interaction. However, existing methods remain fragmented, lacking both a unified platform for fair comparison across architectures and algorithms and an efficient system design for scalable training. To address these challenges, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. RLinf-VLA achieves unification by providing a unified interface that standardizes the integration of diverse VLA architectures, multiple RL algorithms, and heterogeneous simulators, enabling extensibility. To ensure efficiency, the system adopts a flexible resource allocation architecture for rendering, inference, and training workloads in RL pipelines. In particular, for GPU-parallelized simulators, RLinf-VLA introduces a hybrid fine-grained pipeline allocation strategy, yielding a 1.61x-1.88x training speedup. Using this unified system, models trained with RLinf-VLA demonstrate consistent performance improvements of approximately 20-85% across multiple simulation benchmarks, including LIBERO, ManiSkill, and RoboTwin. Furthermore, we distill a set of training practices for effective RL-based VLA training. We position RLinf-VLA as a foundational system to enable efficient, unified, and reproducible research in embodied intelligence.
翻译:近期,视觉-语言-动作(VLA)模型的进展推动了其能力向具身智能场景的扩展,其中强化学习(RL)为通过交互优化任务成功率提供了一种原则性方法。然而,现有方法仍较为零散,既缺乏一个用于公平比较不同架构与算法的统一平台,也缺少一个支持可扩展训练的高效系统设计。为应对这些挑战,我们提出了RLinf-VLA,一个用于VLA模型可扩展强化学习的统一高效框架。RLinf-VLA通过提供统一接口实现了标准化,能够整合多样的VLA架构、多种RL算法以及异构仿真器,从而具备良好的可扩展性。为确保效率,该系统采用了灵活的资源配置架构,以处理RL流程中的渲染、推理与训练工作负载。特别是针对GPU并行化的仿真器,RLinf-VLA引入了一种混合细粒度流水线分配策略,实现了1.61倍至1.88倍的训练加速。利用这一统一系统,经RLinf-VLA训练的模型在多个仿真基准测试(包括LIBERO、ManiSkill和RoboTwin)中均表现出约20%至85%的稳定性能提升。此外,我们总结了一套用于基于RL的VLA模型高效训练的最佳实践。我们将RLinf-VLA定位为一个基础性系统,旨在推动具身智能领域高效、统一且可复现的研究。