Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated through Monte-Carlo simulations and experiments spanning a synthetic graph-based reasoning task and multiple standard mathematical reasoning benchmarks.
翻译:迭代自改进通过在大语言模型自身生成的奖励验证输出上微调,对自回归大语言模型进行优化。尽管自改进方法在实证中取得成功,但在实际有限样本场景中,这种生成式迭代过程的理论基础仍较为有限。我们通过将每轮自改进建模为奖励过滤分布上的最大似然微调,并推导期望奖励的有限样本保证,向该目标迈出进展。分析揭示了一个明确的反馈循环:更优的模型每轮迭代可接受更多数据,从而支撑持续的自改进,同时解释了这种改进最终会趋于饱和。从任务中心视角出发,通过考虑具有多个难度等级的推理任务,我们进一步证明了量化条件:在模型初始化、任务难度和样本预算下,易到难课程可证明比固定混合任务训练取得更优保证。我们的分析通过蒙特卡洛模拟以及基于合成图的推理任务和多个标准数学推理基准实验得到验证。