Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.
翻译:通过并行采样扩展测试时计算能力可显著提升大语言模型的推理性能,但常受限于最佳N选一的选择质量。生成式选择方法(如GenSelect)虽能缓解这一瓶颈,但强大的选择性能仍主要局限于大型模型。我们证明,小型推理模型可通过针对性强化学习获得强大的生成式选择能力。为此,我们从大规模数学与代码指令数据集中筛选出同时包含正确与错误候选解的实例,以此合成选择任务,并采用DAPO算法训练17亿参数模型以奖励正确选择。在数学推理基准测试(AIME24、AIME25、HMMT25)与代码推理基准测试(LiveCodeBench)中,我们的模型始终优于基于提示与多数投票的基线方法,其表现常接近甚至超越参数量大得多的模型。值得注意的是,尽管训练时仅使用较弱模型的输出,这些性能提升在选择更强模型的输出时仍能保持泛化能力。总体而言,我们的研究确立了强化学习作为一种可扩展的途径,能够为小型模型解锁强大的生成式选择能力,从而实现高效的测试时扩展。