Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models is severely bottlenecked by synchronization barriers and the high cost of environment data acquisition. To overcome these challenges, we propose AcceRL, a distributed asynchronous RL framework that physically isolates environment rollouts, model inference, and gradient updates. By eliminating the cascading long-tail idle bubbles inherent in synchronous systems, AcceRL maximizes hardware utilization and ensures scalable throughput. Furthermore, AcceRL features a modular design that supports the integration of diverse, plug-and-play world models into its distributed pipeline. Extensive experiments demonstrate that the base framework achieves highly competitive performance across all four LIBERO~\cite{liu2023libero} task suites. Systematically, the asynchronous architecture delivers a $2.4\times$ throughput speedup over leading synchronous baselines. Algorithmically, by leveraging a world model pre-trained on 1,000 offline trajectories, AcceRL achieves up to a $200\times$ improvement in online sample efficiency on LIBERO-Spatial, establishing a robust framework that is both sample-efficient and time-efficient for embodied AI. Code is included in the supplementary material. Code is available at https://github.com/distanceLu/AcceRL.
翻译:大规模视觉-语言-动作(VLA)模型的强化学习(RL)因同步障碍和环境数据采集的高昂成本而严重受限。为克服这些挑战,我们提出AcceRL——一种分布式异步强化学习框架,该框架将环境交互、模型推理和梯度更新进行物理隔离。通过消除同步系统中固有的级联长尾空闲气泡,AcceRL最大化硬件利用率并确保可扩展吞吐量。此外,AcceRL采用模块化设计,支持将多种即插即用世界模型集成到其分布式流水线中。大量实验表明,基础框架在全部四个LIBERO~\cite{liu2023libero}任务套件中均实现了极具竞争力的性能。系统层面,异步架构相比领先的同步基线实现了$2.4\times$的吞吐量加速。算法层面,通过利用在1000条离线轨迹上预训练的世界模型,AcceRL在LIBERO-Spatial上实现了高达$200\times$的在线样本效率提升,为具身AI建立了一个兼具样本效率和时间效率的鲁棒框架。代码包含在补充材料中,并开源至https://github.com/distanceLu/AcceRL。