Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.
翻译:基于人类反馈的强化学习(\textbf{RLHF})已成为将大语言模型输出与人类偏好对齐的主流方法。受RLHF成功的启发,我们研究了多种从反馈中学习的算法(专家迭代、近端策略优化(\textbf{PPO})、基于返回值的条件强化学习)在提升大语言模型推理能力方面的表现。我们探讨了启发式以及通过学得奖励模型为LLM提供的稀疏奖励和稠密奖励。此外,我们从多种模型规模和初始化状态出发,分别考察了有无监督微调(\textbf{SFT})数据的场景。总体而言,我们发现所有算法表现相近,其中专家迭代在多数情况下表现最佳。令人惊讶的是,专家迭代的样本复杂度与PPO相当,从预训练检查点收敛所需样本量最多约为$10^6$量级。我们研究了这一现象的原因,并得出结论:在强化学习训练过程中,模型未能显著探索出超越SFT模型已有解决方案的路径。此外,我们讨论了SFT训练过程中maj@1与pass@96指标性能之间的权衡,并对比指出RL训练如何同时提升这两项指标。最后,我们总结了上述发现对RLHF的启示,以及强化学习在未来大语言模型微调中的角色。