We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
翻译:我们提出,要实现超人类智能体,未来模型需要获得超人类反馈以提供充分的训练信号。当前方法通常基于人类偏好训练奖励模型,这可能受限于人类表现水平;其次,这些独立冻结的奖励模型无法在大型语言模型训练过程中持续学习改进。本研究探索自我奖励语言模型,即通过LLM-as-a-Judge提示技术,使语言模型在训练过程中为自身提供奖励。我们证明,在迭代DPO训练过程中,不仅指令跟随能力得到提升,模型为自身提供高质量奖励的能力也同步增强。基于Llama 2 70B模型进行三轮迭代微调后,所得模型在AlpacaEval 2.0排行榜上超越包括Claude 2、Gemini Pro和GPT-4 0613在内的多个现有系统。尽管仍有诸多问题亟待探索,但本研究为模型在双重维度上持续自我提升的可能性开辟了新路径。