The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $γ$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.
翻译:“勤奋学习者”框架提出,只要具备足够的步骤成功概率 $γ$,大型语言模型(LLMs)可以通过测试时搜索实现超级智能。在本研究中,我们设计了一个基准来测量逻辑分布外推理任务上的 $γ$ 值。我们构建了一类涉及 GF(2) 电路重构的任务,其难度随推理步骤增加而递增;从信息论角度看,除非 LLM 仔细整合所有提供的信息,否则不可能可靠地解决这些任务。我们的分析表明,虽然小型 LLM 的 $γ$ 值随深度增加呈超线性下降,但前沿模型在此任务上表现出部分鲁棒性。此外,我们发现大规模成功推理依赖于精确的工具调用,从而将工具设计确定为 LLM 通过“勤奋学习者”框架实现通用超级智能的关键能力。