Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA with Gemini 3 Flash attains performance near the top of the ARC-AGI-2 public leaderboard. RSA also enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further propose a novel aggregation-aware reinforcement learning approach that yields significant performance gains by training the model to combine solutions.
翻译:测试时扩展方法通过增加推理过程中的计算量来提升大型语言模型(LLMs)的预测能力。推理时计算可通过并行方式(从多个独立解中选择)或序列方式(通过自我精炼)进行扩展。我们提出递归自聚合(RSA),这是一种受进化方法启发的测试时扩展方法,它结合了并行与序列扩展的双重优势。RSA的每一步都通过对子集进行聚合来精炼候选推理链群体,从而产生改进解的群体,这些解随后作为下一次迭代的候选池。实证表明,RSA在不同任务、模型系列和规模上均能随着计算预算的增加带来显著的性能提升。值得注意的是,采用Gemini 3 Flash的RSA在ARC-AGI-2公开排行榜上达到了接近顶端的性能水平。RSA还使Qwen3-4B-Instruct-2507能够在AIME-25、HMMT-25、Reasoning Gym、LiveCodeBench-v6和SuperGPQA等基准测试中,与包括DeepSeek-R1和o3-mini(high)在内的大型推理模型实现竞争性表现,且优于纯并行和序列扩展策略。我们进一步提出一种新颖的聚合感知强化学习方法,通过训练模型整合解决方案,实现了显著的性能提升。