Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
翻译:偏好优化技术,例如直接偏好优化(DPO),通常用于在监督微调之后,提升大型语言模型(LLM)在数学推理和代码生成等领域的推理能力。这些方法依赖于推理任务的高质量标签来生成偏好对;然而,具备人工验证标签的推理数据集十分有限。在本研究中,我们提出了一种新颖的方法,通过将推理问题解决方案的标注过程,转化为基于相关测试用例的评估,从而为推理任务生成伪反馈。我们探索了基于测试用例的两种伪反馈形式:一种由前沿LLM生成,另一种通过将自洽性扩展到多测试用例来生成。我们在数学推理和代码生成任务上使用伪反馈进行偏好优化实验,并观察到两项任务均有提升。具体而言,以Mathstral-7B作为基础模型,我们将MATH数据集上的结果从58.3提升至68.6,超越了NuminaMath-72B和GPT-4-Turbo-1106-preview。在GSM8K和College Math数据集上,我们的得分分别从85.6提升至90.3,以及从34.3提升至42.3。基于Deepseek-coder-7B-v1.5,我们在LiveCodeBench上的得分达到24.6(从21.1提升),超越了Claude-3-Haiku。