The alignments of reasoning abilities between smaller and larger Language Models are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger Language Models.
翻译:较小与较大语言模型之间推理能力的对齐主要通过监督微调实现,其利用鲁棒大规模语言模型生成的示例进行训练。尽管这类方法能产出性能更优的模型,但由于训练仅依赖于提供的示例,其泛化能力仍显不足。本文提出自我精炼指令调优方法,促使较小语言模型自我精炼其能力。该方法基于两阶段流程:首先通过大语言模型提供的示例进行指令调优,实现推理能力在大语言模型与小语言模型之间的迁移;随后通过偏好优化策略使经指令调优的模型自我精炼能力。具体而言,第二阶段基于直接偏好优化算法运行精炼启发式过程,通过自动采样生成响应并利用大语言模型提供的真实值给予奖励,促使小语言模型生成一系列推理路径。在常识推理与数学推理任务上的实验结果表明,该方法在域内与域外场景中均显著优于指令调优,有效实现了较小与较大语言模型推理能力的对齐。