Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.
翻译:迭代偏好优化方法近期已被证明在通用指令微调任务中表现良好,但通常对推理任务改进甚微(Yuan等人,2024;Chen等人,2024)。本研究提出一种迭代方法,通过优化竞争性思维链候选方案中导致正确答案的胜出与失败推理步骤之间的偏好进行训练。我们采用改进的直接偏好优化损失(Rafailov等人,2023),并引入额外的负对数似然项——此举被证明至关重要。实验表明,该方案的重复迭代能持续提升推理性能。仅依赖训练集样本的情况下,本方法使Llama-2-70B-Chat模型在GSM8K上的准确率从55.6%提升至81.6%(采用32样本多数投票时达88.7%),MATH数据集从12.5%提升至20.8%,ARC-Challenge从77.8%提升至86.7%,显著优于其他未依赖外部数据源的Llama-2系列模型。