Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.
翻译:自改进方法使大型语言模型(LLM)能够自主生成解决方案,并在经过筛选的高质量推理过程上进行迭代训练。这一过程被证明是有效的,并减少了LLM推理中对人工监督的依赖,但模型性能很快会进入平台期。我们深入分析该过程后发现,模型倾向于对简单查询进行过采样,而对尚未掌握的查询则采样不足。随着迭代的进行,这种采样不平衡会加剧,导致长尾分布——困难查询的解决方案几乎消失。这一现象限制了自改进模型的性能提升。直接的解决方案是通过暴力采样来平衡分布,但这会显著增加计算成本。本文提出引导式自改进(GSI)策略,旨在提高对具有挑战性的重尾数据采样的效率。该方法利用苏格拉底式引导信号辅助LLM处理复杂查询的推理,减少探索成本并最小化计算开销。在多样化数学任务上对四个模型进行的实验表明,GSI在性能与效率之间取得了平衡,同时在未见任务上也表现出有效性。