As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
翻译:随着大语言模型(LLM)能力日益增强,针对人类意图对齐的微调技术变得愈发重要。对齐这些模型的一个关键考量是如何最高效地利用人力资源,或在将LLM本身用作评判模型时如何有效利用模型资源。基于人类或AI偏好的强化学习(RLHF/RLAIF)是此类技术中最突出的代表,但其过程复杂且往往不稳定。直接偏好优化(DPO)近期被提出作为一种更简单、更稳定的替代方案。在本研究中,我们为DPO开发了一种主动学习策略,以更有效地利用偏好标签。我们提出了一种基于语言模型预测熵以及DPO优化的隐式偏好模型确定性度量的提示/补全对获取函数。我们证明了该方法如何同时提升基于成对偏好数据的微调学习速率与最终性能。