Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $\beta$, as well as to the quality of the preference data. We analyze the impact of $\beta$ and data quality on DPO, uncovering that optimal $\beta$ values vary with the informativeness of pairwise data. Addressing the limitations of static $\beta$ values, we introduce a novel framework that dynamically calibrates $\beta$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $\beta$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $\beta$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.
翻译:直接偏好优化(DPO)已成为训练大语言模型(LLM)以遵循人类偏好的一种重要方法。然而,DPO的性能对其权衡参数$\beta$的微调以及偏好数据的质量较为敏感。我们分析了$\beta$与数据质量对DPO的影响,发现最优$\beta$值会随着成对数据信息量的变化而改变。针对静态$\beta$值的局限性,我们提出了一种新颖的框架,该框架基于数据质量考量,在批次级别动态校准$\beta$。此外,我们的方法引入了$\beta$引导的数据过滤机制,以防范异常值的影响。通过实证评估,我们证明了动态$\beta$调整技术能够显著提升DPO在多种模型和数据集上的性能,为基于人类反馈的LLM对齐训练提供了一个更鲁棒且适应性更强的范式。代码发布于\url{https://github.com/junkangwu/beta-DPO}。