RLinf-VLA：一个用于视觉-语言-动作模型强化学习的统一高效框架 (RLinf-VLA: A Unified and Efficient Framework for Reinforcement Learning of Vision-Language-Action Models)

Hongzhi Zang,Mingjie Wei,Si Xu,Yongji Wu,Zhen Guo,Yuanqing Wang,Hao Lin,Peihong Wang,Liangzhi Shi,Yuqing Xie,Zhexuan Xu,Zhihao Liu,Kang Chen,Wenhao Tang,Quanlu Zhang,Weinan Zhang,Chao Yu,Yu Wang

from arxiv, This is the technical report of the RLinf Team, focusing on the algorithm side. For the system-level design, please refer to arXiv:2509.15965 . The open-sourced code link: https://github.com/RLinf/RLinf

Recent advances in vision-language-action (VLA) models have motivated the extension of their capabilities to embodied settings, where reinforcement learning (RL) offers a principled way to optimize task success through interaction. However, existing methods remain fragmented, lacking both a unified platform for fair comparison across architectures and algorithms and an efficient system design for scalable training. To address these challenges, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. RLinf-VLA achieves unification by providing a unified interface that standardizes the integration of diverse VLA architectures, multiple RL algorithms, and heterogeneous simulators, enabling extensibility. To ensure efficiency, the system adopts a flexible resource allocation architecture for rendering, inference, and training workloads in RL pipelines. In particular, for GPU-parallelized simulators, RLinf-VLA introduces a hybrid fine-grained pipeline allocation strategy, yielding a 1.61x-1.88x training speedup. Using this unified system, models trained with RLinf-VLA demonstrate consistent performance improvements of approximately 20-85% across multiple simulation benchmarks, including LIBERO, ManiSkill, and RoboTwin. Furthermore, we distill a set of training practices for effective RL-based VLA training. We position RLinf-VLA as a foundational system to enable efficient, unified, and reproducible research in embodied intelligence.

翻译：近期，视觉-语言-动作（VLA）模型的进展推动了其能力向具身智能场景的扩展，其中强化学习（RL）为通过交互优化任务成功率提供了一种原则性方法。然而，现有方法仍较为零散，既缺乏一个用于公平比较不同架构与算法的统一平台，也缺少一个支持可扩展训练的高效系统设计。为应对这些挑战，我们提出了RLinf-VLA，一个用于VLA模型可扩展强化学习的统一高效框架。RLinf-VLA通过提供统一接口实现了标准化，能够整合多样的VLA架构、多种RL算法以及异构仿真器，从而具备良好的可扩展性。为确保效率，该系统采用了灵活的资源配置架构，以处理RL流程中的渲染、推理与训练工作负载。特别是针对GPU并行化的仿真器，RLinf-VLA引入了一种混合细粒度流水线分配策略，实现了1.61倍至1.88倍的训练加速。利用这一统一系统，经RLinf-VLA训练的模型在多个仿真基准测试（包括LIBERO、ManiSkill和RoboTwin）中均表现出约20%至85%的稳定性能提升。此外，我们总结了一套用于基于RL的VLA模型高效训练的最佳实践。我们将RLinf-VLA定位为一个基础性系统，旨在推动具身智能领域高效、统一且可复现的研究。