Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
翻译:训练大量推理过程(即思维链微调)能有效提升大语言模型(LLMs)的推理能力。然而,获取人工撰写的推理过程或从专有模型扩充推理过程成本高昂且难以规模化。本文研究了大语言模型能否自我提升推理能力的问题。为此,我们提出自我探索(Self-Explore)方法,要求模型探索推理过程中的首个错误步骤(即首个陷阱),并将此类信号作为细粒度奖励进行进一步改进。在GSM8K和MATH测试集上,与监督微调(SFT)相比,三种大语言模型在自我探索方法下平均分别提升11.57%和2.89%。我们的代码开源在https://github.com/hbin0701/Self-Explore。