This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb face considerable bottlenecks in memory, computation, and communication. GEAR, however, optimizes memory efficiency by enabling the memory resources on GPU servers (including host memory and device memory) to manage trajectory data. Furthermore, it facilitates decentralized GPU devices to expedite various trajectory selection strategies, circumventing computational bottlenecks. GEAR is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand, improving communication efficiency. Cluster experiments have shown that GEAR can achieve performance levels up to 6x greater than Reverb when training state-of-the-art large RL models. GEAR is open-sourced at https://github.com/bigrl-team/gear.
翻译:本文介绍了一种分布式、以GPU为核心的经历回放系统GEAR,专为使用大型序列模型(如Transformer)进行可扩展强化学习(RL)设计。对于此类模型,现有系统(如Reverb)在内存、计算和通信方面面临显著瓶颈。然而,GEAR通过利用GPU服务器的内存资源(包括主机内存和设备内存)管理轨迹数据,优化了内存效率。此外,它支持去中心化的GPU设备加速多种轨迹选择策略,从而规避计算瓶颈。GEAR配备了可通过零拷贝访问主机内存收集轨迹的GPU内核,并支持基于InfiniBand的远程直接内存访问,提升了通信效率。集群实验表明,在训练最先进的大型强化学习模型时,GEAR的性能可达Reverb的6倍。GEAR已在https://github.com/bigrl-team/gear开源。