Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the winning response and the losing response within pairwise data are generated isolatedly, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework named BMC, for bridging and modeling correlations in pairwise data. Firstly, we increase the consistency and informativeness of the pairwise preference signals by targeted modifications, synthesizing a pseudo winning response through improving the losing response based on the winning response. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants.
翻译:直接偏好优化(DPO)是一种广泛采用的离线偏好优化算法,旨在利用成对偏好数据使大语言模型(LLM)与人类期望的行为对齐。然而,成对数据中的获胜响应与失败响应是独立生成的,导致它们之间的相关性较弱,从而影响了对齐性能。为解决此问题,我们提出了一个名为BMC的有效框架,用于在成对数据中建立与建模相关性。首先,我们通过有针对性的修改来增强成对偏好信号的一致性与信息量,即基于获胜响应改进失败响应,从而合成一个伪获胜响应。其次,我们发现仅使用DPO不足以建模这些相关性并捕捉细微的差异。因此,我们提出在训练过程中动态利用策略模型的置信度来学习词元级别的相关性。在问答、数学和指令遵循任务上的全面实验证明了我们方法的有效性,其性能显著超越了包括DPO在内的竞争基线。此外,我们深入的定量分析揭示了我们的方法优于DPO的原因,并展示了其对其他DPO变体的通用性。