Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (\textit{e.g.,} BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (\textit{e.g.,} a vocabulary) and a long action sequence (\textit{e.g.,} a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods.
翻译:将强化学习应用于序列生成模型能够直接优化长期奖励(例如BLEU分数和人类反馈),但通常需要对动作序列空间进行大规模采样。这给序列生成问题的实践带来了计算挑战,例如机器翻译中常面临的大动作空间(如词汇表)和长动作序列(如翻译文本)。本研究引入两阶段采样和动态采样方法,以提升通过强化学习训练序列生成模型时的采样效率。我们在传统序列生成任务(包括机器翻译和抽象式摘要)上验证了所提方法。进一步地,我们通过使用奖励模型训练大语言模型,在基于人类反馈的强化学习框架下评估了这些方法。实验结果表明,这种称为ESRL的高效采样强化学习方法在训练效率和内存消耗两方面均优于所有基线方法。值得注意的是,与强基线REINFORCE、最小风险训练和近端策略优化方法相比,ESRL展现出了稳定一致的性能提升。