Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.
翻译:大语言模型越来越广泛地采用测试时策略:生成$N$个响应,用奖励模型或验证器对其进行评分,然后返回最佳结果。这种部署规则暴露了后训练中的不匹配:标准目标优化的是单个响应的平均奖励,而最佳-of-$N$性能则受奖励分布的上尾支配。近期提出的测试时感知目标部分解决了这一不匹配问题,但这些方法通常假设训练可以使用与部署相同的每提示展开预算,这在后训需要覆盖大量提示而部署可以为每个提示分配更多测试时计算资源时并不实际。我们研究这种预算不匹配场景,即训练时每个提示仅有$m\ll N$个可用展开,但目标函数是最佳-of-$N$部署。在奖励尾部结构性假设下,我们证明最佳-of-$N$目标的策略梯度可以通过外推上尾统计量从更小的展开组中近似得到。这产生了一族面向最佳-of-$N$后训练的尾部外推估计器:简单的直接估计器TEA(尾部外推优势估计),以及基于矩消除的固定阶去偏Prefix-TEA估计器。在指令遵循任务上的实验表明,在不同语言模型、奖励模型和数据集以及各种训练与测试时预算设置下,TEA和Prefix-TEA均能提升最佳-of-$N$性能。