We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.
翻译:我们提出了一种用于微调大型语言模型的新型强化学习算法(AGRO,即任意生成奖励优化算法)。AGRO利用了生成一致性这一概念,该概念指出最优策略需满足模型任意可能生成结果间的一致性要求。我们推导出通过基于样本的策略梯度寻找最优解的算法,并为其收敛性提供了理论保证。实验结果表明,AGRO在同策略与异策略设置下均表现优异,在数学推理数据集上展现出超越基线算法的性能提升。