Reinforcement learning from human feedback (RLHF) has become a cornerstone of the training and alignment pipeline for large language models (LLMs). Recent advances, such as direct preference optimization (DPO), have simplified the preference learning step. However, collecting preference data remains a challenging and costly process, often requiring expert annotation. This cost can be mitigated by carefully selecting the data points presented for annotation. In this work, we propose an active learning approach to efficiently select prompt and preference pairs using a risk assessment strategy based on the Sharpe Ratio. To address the challenge of unknown preferences prior to annotation, our method evaluates the gradients of all potential preference annotations to assess their impact on model updates. These gradient-based evaluations enable risk assessment of data points regardless of the annotation outcome. By leveraging the DPO loss derivations, we derive a closed-form expression for computing these Sharpe ratios on a per-tuple basis, ensuring our approach remains both tractable and computationally efficient. We also introduce two variants of our method, each making different assumptions about prior information. Experimental results demonstrate that our method outperforms the baseline by up to 5% in win rates against the chosen completion with limited human preference data across several language models and real-world datasets.
翻译:基于人类反馈的强化学习(RLHF)已成为大型语言模型(LLM)训练与对齐流程的基石。直接偏好优化(DPO)等最新进展简化了偏好学习步骤。然而,收集偏好数据仍然是一个具有挑战性且成本高昂的过程,通常需要专家标注。通过精心选择待标注的数据点可以降低这一成本。本研究提出一种主动学习方法,利用基于夏普比率的风险评估策略,高效选择提示与偏好对。为应对标注前偏好未知的挑战,本方法通过评估所有潜在偏好标注的梯度来分析其对模型更新的影响。这些基于梯度的评估使得数据点的风险评估不受具体标注结果的影响。借助DPO损失函数的推导,我们得出了基于每个数据元组计算这些夏普比率的闭式表达式,从而确保方法兼具可处理性与计算效率。我们还提出了本方法的两种变体,各自对先验信息作出不同假设。实验结果表明,在多种语言模型和真实数据集上,使用有限的人类偏好数据时,本方法在针对选定补全结果的胜率上较基线模型提升最高达5%。