Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.
翻译:强化学习(RL)是大语言模型(LLM)后训练的关键组成部分。然而,现有用于后训练的同策略算法本质上与经验回放缓冲区的使用不兼容,而该缓冲区可通过分布式异策略执行者进行可扩展填充,以在计算资源增加时增强探索。我们提出通过轨迹平衡与异步性(TBA)——一个大规模可扩展的LLM RL系统——高效获取回放缓冲区的这一优势。与现有方法相比,TBA将更大比例的计算资源用于搜索,持续为中央回放缓冲区生成异策略数据。一个训练节点同时基于奖励或时效性从该缓冲区采样数据,并使用轨迹平衡(TB)——一种为GFlowNets引入的寻求多样性的RL目标——来更新策略。TBA提供三个关键优势:(1)解耦训练与搜索,将训练挂钟时间加速4倍或更多;(2)通过大规模异策略采样提升多样性;(3)为稀疏奖励设置提供可扩展的搜索能力。在数学推理、偏好微调和自动化红队测试(多样且具代表性的后训练任务)上,TBA相较于强基线在速度和性能上均取得了提升。