As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
翻译:随着大语言模型(LLMs)能力不断增强,使其与人类意图对齐的微调技术变得愈发重要。对齐这些模型的关键考量在于如何最有效地利用人力资源,或当大语言模型自身作为评估者时如何利用模型资源。基于人类或AI偏好的强化学习(RLHF/RLAIF)是此类技术中最具代表性的方法,但其过程复杂且常不稳定。最近提出的直接偏好优化(DPO)算法则提供了更简单且稳定的替代方案。本研究为DPO开发了一种主动学习策略,以更高效地利用偏好标注数据。我们提出了一种基于语言模型预测熵与DPO隐式偏好模型确定性度量的提示/回答对实用采集函数,并证明了该方法在成对偏好数据微调中能同时提升学习速率与最终性能。