Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Parallel self-refinement, generating multiple candidates and synthesizing a refined answer conditioned on them, offers a promising alternative, but the underlying mechanism driving its effectiveness remains obscure. To bridge this gap in understanding, we introduce a new metric, the Refinement Gap, designed to quantify the relative improvement of self-refinement beyond majority voting. We show that the Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability. Based on this discovery, we propose Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students. Crucially, GSR jointly trains a single model to generate strong candidates and refine a better final answer based on these candidates. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks over other parallel aggregation methods, while the learned refinement skill transfers across multiple model scales and families and exhibits robust generalization to an out-of-distribution domain.
翻译:测试时缩放(TTS)因能增强大语言模型推理能力而受到广泛关注。现有方法如Best-of-N和多数投票存在局限性,其性能依赖候选回答的质量,当所有候选结果均错误时无法生成正确答案。并行自我精炼方法通过生成多个候选结果并基于它们合成精炼答案,提供了一种有前景的替代方案,但其有效性背后的机制尚不明确。为填补这一认知空白,我们提出新指标——精炼差距(Refinement Gap),用于量化自我精炼相比多数投票的相对改进程度。实验表明,精炼差距随模型规模呈现清晰缩放趋势,且与基础能力仅弱相关。基于此发现,我们提出生成式自我精炼(GSR)——一种并行测试时缩放框架,通过将高精炼差距的大规模教师模型的精炼策略迁移至小规模学生模型。关键在于,GSR联合训练单一模型生成强候选结果,并基于这些候选结果精炼出更优的最终答案。实验结果表明,在五个数学基准测试中,本方法相较其他并行聚合方法取得最优性能,且所学精炼技能可跨模型规模与家族迁移,并在分布外领域展现出稳健的泛化能力。