Recent work in language modeling has raised the possibility of self-improvement, where a language models evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ``sharpen'' the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner aims to sharpen a pre-trained base policy via sample access, and establish fundamental limits. Then we analyze two natural families of self-improvement algorithms based on SFT and RLHF. We find that (i) the SFT-based approach is minimax optimal whenever the initial model has sufficient coverage, but (ii) the RLHF-based approach can improve over SFT-based self-improvement by leveraging online exploration, bypassing the need for coverage. Finally, we empirically validate the sharpening mechanism via inference-time and amortization experiments. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms.
翻译:近期语言建模研究提出了自我提升的可能性,即语言模型通过评估和优化自身生成内容来实现性能提升,而无需外部反馈。这种自我提升无法创造模型中不存在的信息,那么我们为何预期它能提升模型能力?本文通过我们称之为"锐化"的视角,为自我提升能力提供了新的理论框架。基于语言模型在验证回答质量方面往往优于生成正确答案的观察,我们将自我提升形式化为:在后训练阶段使用模型自身作为验证器,通过"锐化"使模型将概率质量集中于高质量序列,从而将生成优质序列所需的高昂推理计算成本进行摊销。我们首先建立了锐化过程的统计框架,其中学习者的目标是通过样本访问来锐化预训练的基础策略,并确定了其理论极限。随后分析了基于监督微调(SFT)和基于人类反馈的强化学习(RLHF)的两类自然自我提升算法。研究发现:(1)当初始模型具有充分覆盖性时,基于SFT的方法达到极小极大最优;(2)基于RLHF的方法通过在线探索机制,能够突破覆盖性限制,实现优于SFT的自我提升效果。最后,我们通过推理时实验和摊销实验对锐化机制进行了实证验证。这些发现为建立指导自我提升算法设计与评估的基础理论框架提供了起点。