Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
翻译:迭代式偏好优化方法近期在通用指令微调任务上表现出色,但在推理任务上的改进通常有限(Yuan等,2024;Chen等,2024)。本研究提出一种迭代方法,通过优化竞争性生成式思维链(Chain-of-Thought, CoT)候选项中的胜出与失败推理步骤,从而提升目标答案的正确性。训练中采用改进的DPO损失函数(Rafailov等,2023),并引入额外负对数似然项——实验表明此项至关重要。我们验证了该方法在多次迭代后推理能力的持续提升。尽管仅依赖训练集中的示例,本方法使Llama-2-70B-Chat在GSM8K、MATH和ARC-Challenge基准测试中的准确率持续提高,性能优于其他未使用额外数据集的Llama-2模型。例如,GSM8K准确率从55.6%显著提升至81.6%,32样本多数投票准确率更达到88.7%。