How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
翻译:如何利用人工智能为科学问题发现新的最优解?先前关于测试时扩展的研究(如AlphaEvolve)通过提示冻结的大型语言模型进行搜索。我们在测试时执行强化学习,使大型语言模型能够持续训练,但此时训练经验专门针对待测问题。这种持续学习形式非常特殊,其目标是产生一个卓越解而非平均意义上的多个良好解,且旨在解决当前特定问题而非泛化至其他问题。因此,我们的学习目标和搜索子程序均被设计为优先探索最具潜力的解决方案。我们将此方法称为“测试时训练发现法”。遵循先前研究,我们聚焦于具有连续奖励的问题。我们在数学、GPU内核工程、算法设计和生物学等领域尝试的所有问题上均报告了结果。TTT-Discover在几乎所有问题上都确立了新的最优解:(i)埃尔德什最小重叠问题与自相关不等式;(ii)GPUMode内核竞赛(比现有技术快达2倍);(iii)往届AtCoder算法竞赛;(iv)单细胞分析中的去噪问题。我们的解决方案均经过专家或组织方评审。所有结果均使用开源模型OpenAI gpt-oss-120b实现,且可通过我们公开的代码复现,而先前的最佳结果需依赖封闭前沿模型。我们的测试时训练运行使用Thinking Machines公司的Tinker API执行,每个问题的成本仅为数百美元。