Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We evaluate our method across various mathematical reasoning tasks using different base models. Our experiments show that this approach not only improves LMs' reasoning performance but also offers a more cost-effective and scalable solution compared to relying on large proprietary LMs.
翻译:有效训练语言模型以执行数学推理任务需要高质量的监督微调数据。除了获取人类专家的标注外,一种常见的替代方案是从规模更大、能力更强的语言模型中采样。然而,这种知识蒸馏方法可能成本高昂且不稳定,尤其是在依赖闭源、专有模型(如GPT-4)时,其行为往往难以预测。本研究表明,小规模语言模型的推理能力可以通过自训练(即模型从自身输出中学习)得到提升。我们还证明,传统的自训练可以进一步通过一种名为直接偏好优化(DPO)的偏好学习算法进行增强。通过将DPO整合到自训练中,我们利用偏好数据引导语言模型生成更准确、更多样化的思维链推理。我们在多种数学推理任务上使用不同的基础模型评估了该方法。实验结果表明,与依赖大型专有语言模型相比,此方法不仅提升了语言模型的推理性能,还提供了更具成本效益和可扩展性的解决方案。