What should post-training optimize? A test-time scaling law perspective

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.

翻译：大语言模型越来越广泛地采用测试时策略：生成$N$个响应，用奖励模型或验证器对其进行评分，然后返回最佳结果。这种部署规则暴露了后训练中的不匹配：标准目标优化的是单个响应的平均奖励，而最佳-of-$N$性能则受奖励分布的上尾支配。近期提出的测试时感知目标部分解决了这一不匹配问题，但这些方法通常假设训练可以使用与部署相同的每提示展开预算，这在后训需要覆盖大量提示而部署可以为每个提示分配更多测试时计算资源时并不实际。我们研究这种预算不匹配场景，即训练时每个提示仅有$m\ll N$个可用展开，但目标函数是最佳-of-$N$部署。在奖励尾部结构性假设下，我们证明最佳-of-$N$目标的策略梯度可以通过外推上尾统计量从更小的展开组中近似得到。这产生了一族面向最佳-of-$N$后训练的尾部外推估计器：简单的直接估计器TEA（尾部外推优势估计），以及基于矩消除的固定阶去偏Prefix-TEA估计器。在指令遵循任务上的实验表明，在不同语言模型、奖励模型和数据集以及各种训练与测试时预算设置下，TEA和Prefix-TEA均能提升最佳-of-$N$性能。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型后训练技术：离策与在策学习的统一视角

专知会员服务

15+阅读 · 4月10日

【博士论文】面向下游任务的语言模型优化：一种后训练视角

专知会员服务

24+阅读 · 2025年7月6日

重新审视测试时扩展：一项综述与面向多样性的高效推理方法

专知会员服务

10+阅读 · 2025年6月8日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日