Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
翻译:基于人类反馈的强化学习(RLHF)通常用于将大型语言模型(LLM)的行为与人类偏好对齐。近年来,一种流行的替代方案是直接策略优化(DPO),该方法用策略本身替代了基于LLM的奖励模型,从而避免了学习奖励模型所需的额外内存和训练时间。然而,DPO未考虑正负响应的相对质量,可能导致次优的训练结果。为缓解此问题,我们研究了如何利用即时微调LLM中的内在知识来获取相对质量,并帮助优化损失函数。具体而言,我们利用LLM的知识设计了一个优化函数,用于评估正负响应的质量。我们证明,在温和假设下,所构建的优化函数能够帮助自优化损失函数。该优化函数被集成到DPO及其变体恒等策略优化(IPO)中。多种评估器的实验结果表明,相较于DPO和IPO,该方法能够提升微调模型的性能。