Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant computational and financial costs due to its reliance on training external reward models or human-labeled preferences. In this work, we propose Implicit Preference Optimization (IPO), an alternative approach that leverages generative LLMs as preference classifiers, thereby reducing the dependence on external human feedback or reward models to obtain preferences. We conduct a comprehensive evaluation on the preference classification ability of LLMs using RewardBench, assessing models across different sizes, architectures, and training levels to validate our hypothesis. Furthermore, we investigate the self-improvement capabilities of LLMs by generating multiple responses for a given instruction and employing the model itself as a preference classifier for Direct Preference Optimization (DPO)-based training. Our findings demonstrate that models trained through IPO achieve performance comparable to those utilizing state-of-the-art reward models for obtaining preferences.
翻译:基于人类反馈的强化学习(RLHF)已成为使大语言模型(LLM)与人类偏好对齐的主要方法。尽管该方法能使LLM达到人类水平的对齐效果,但由于其依赖训练外部奖励模型或人工标注的偏好数据,通常会产生高昂的计算与财务成本。本研究提出隐式偏好优化(IPO),这是一种替代性方法,其将生成式LLM作为偏好分类器加以利用,从而降低了对获取偏好所需的外部人类反馈或奖励模型的依赖。我们使用RewardBench对LLM的偏好分类能力进行了全面评估,检验了不同规模、架构和训练阶段的模型,以验证我们的假设。此外,我们通过为给定指令生成多个响应,并利用模型自身作为偏好分类器进行基于直接偏好优化(DPO)的训练,探究了LLM的自我改进能力。我们的研究结果表明,通过IPO训练的模型在性能上可与使用最先进奖励模型获取偏好的模型相媲美。