Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.
翻译:相较于大型语言模型(LLMs),小型语言模型(SLMs)在效率、成本效益和可定制性方面更具优势,但其在推理等特定领域往往表现欠佳。以往提升SLM推理能力的方法,如监督微调(SFT)和知识蒸馏,通常依赖于昂贵的外部信号,导致SLM在有限的监督信号下容易过度自信,从而限制了其能力。因此,本研究旨在使SLM能够从自迭代反馈中学习推理。通过结合几率比偏好优化(ORPO),我们利用模型自身生成的正负信号对SLM进行微调和对齐。此外,我们通过基于采样的推理模拟和过程奖励模型,在偏好对齐中引入了用于奖励的过程监督。与监督微调(SFT)相比,我们的方法将Gemma-2B模型在GSM8K上的性能提升了12.43(准确率),在MBPP上提升了3.95(Pass@1)。此外,所提方法在MMLU_Math和HumanEval上也展现出了优异的领域外泛化能力。