In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five public large language models on four datasets. Our code is publicly available at https://github.com/MrBlankness/TPO.git.
翻译:在复杂推理任务(如数学推理)领域,近期研究提出使用直接偏好优化(DPO)来抑制非偏好响应的输出,从而增强大语言模型(LLM)的长链推理能力。为此,这些研究通过思维树(ToT)生成偏好树,并采样DPO算法所需的配对偏好响应。然而,基于二元偏好优化的DPO算法无法学习偏好树提供的具有不同偏好/非偏好程度的多重响应,导致偏好学习不完整。本文提出树偏好优化(TPO)方法,它不从偏好树中采样配对偏好响应,而是在微调过程中直接从整个偏好树学习。具体而言,TPO将语言模型对齐问题建模为偏好列表排序问题,使策略能够从给定提示的排序偏好响应列表中更有效地学习。此外,为帮助LLM识别长链推理中的判别性步骤并扩大偏好列表中的相对奖励间隔,TPO采用自适应步骤奖励机制调整轨迹中每一步的奖励值,以实现细粒度偏好优化。我们在数学推理任务上进行了大量实验以评估TPO。实验结果表明,在四个数据集上,TPO对五种公开大语言模型均持续优于DPO。代码已公开于https://github.com/MrBlankness/TPO.git。