Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences -- particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT's ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO's superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The implementation details can be found in https://github.com/XXares/TODO.
翻译:将大语言模型(LLM)与人类意图对齐对于提升其在各类任务中的表现至关重要。标准的对齐技术,如直接偏好优化(DPO),通常依赖于二元Bradley-Terry(BT)模型,该模型在捕捉人类偏好的复杂性方面可能存在困难——尤其是在存在噪声或不一致标签以及频繁出现平局的情况下。为了解决这些局限性,我们引入了平局导向的Bradley-Terry模型(TOBT),这是BT模型的一个扩展,它显式地纳入了平局情况,从而能够实现更细致的偏好表示。在此基础上,我们提出了平局导向的直接偏好优化(TODO),这是一种新颖的对齐算法,它利用TOBT的三元排序系统来改进偏好对齐。在Mistral-7B和Llama 3-8B模型上的评估表明,TODO在分布内和分布外数据集上的偏好建模方面均持续优于DPO。使用MT Bench以及Piqa、ARC-c和MMLU等基准进行的进一步评估也证明了TODO具有更优的对齐性能。值得注意的是,TODO在二元偏好对齐方面也表现出色,突显了其多功能性以及在LLM对齐中更广泛集成的潜力。实现细节可在https://github.com/XXares/TODO找到。