Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
翻译:在大量推理链数据上进行训练(即思维链微调)能有效提升大型语言模型的推理能力。然而,获取人工撰写的推理链或从专有模型扩充推理链成本高昂且难以扩展。本文研究了语言模型能否自我改进推理能力的问题。为此,我们提出“自我探索”(Self-Explore)方法,让语言模型探索推理链中的首个错误步骤(即首个陷阱),并将此类信号作为细粒度奖励用于后续改进。在GSM8K和MATH测试集上,相比监督微调(SFT),Self-Explore在三个语言模型上平均分别实现了11.57%和2.89%的性能提升。我们的代码已开源:https://github.com/hbin0701/Self-Explore。