LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Tong Zheng,Haolin Liu,Chengsong Huang,Huiwen Bao,Sheng Zhang,Rui Liu,Runpeng Dai,Ruibo Chen,Chenxi Liu,Tianyi Xiong,Xidong Wu,Hongming Zhang,Heng Huang

from arxiv, 25 pages

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

翻译：测试时扩展通过在推理阶段分配额外计算资源，已成为提升大语言模型性能的有效手段。然而，现有测试时扩展策略主要依赖人工设计：研究者凭直觉手动设计推理模式并调整启发式方法，导致大量计算分配空间未被探索。本文提出一种环境驱动框架AutoTTS，改变了研究者的设计对象——从单个测试时扩展启发式方法过渡到可自动发现测试时扩展策略的环境。AutoTTS的核心在于环境构建：发现环境必须使控制空间可解，并为测试时扩展搜索提供廉价且高频的反馈。作为具体实例，我们将宽度-深度测试时扩展形式化为对预收集推理轨迹与探测信号的控制器综合问题，其中控制器决定何时分支、继续、探测、剪枝或停止，且无需重复调用LLM即可低成本评估。我们进一步引入Beta参数化使搜索可解，并通过细粒度执行轨迹反馈帮助智能体诊断测试时扩展程序失败原因，从而提升发现效率。数学推理基准实验表明，发现策略在准确率-成本权衡上优于强人工设计基线。该策略可泛化至未见基准与模型规模，而完整发现过程仅需39.9美元成本和160分钟。我们的数据和代码将开源至https://github.com/zhengkid/AutoTTS。