As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
翻译:随着大型语言模型(LLM)日益处理复杂的推理任务,测试时扩展对于提升模型能力变得至关重要。然而,在频繁调用工具的智能体场景中,传统基于生成长度的定义失效了:工具延迟使推理时间与生成长度解耦。我们提出时序机器,将测试时重新定义为挂钟时间,使模型能够基于时间预算动态调整策略。我们引入时序评估基准,涵盖高频工具调用、低频工具调用以及时间受限推理。通过改变工具延迟,我们发现较小模型通过更多交互获得快速反馈时表现优异,而较大模型在延迟较高的场景中凭借更优的交互质量占据主导。此外,现有模型未能使推理适应时间预算。我们提出时序强化学习来解决这一不足。在冷启动监督微调后,我们使用强化学习来增强时序规划能力。时序强化学习提升了时间预算感知能力,并在时序评估基准上持续提高性能。我们希望这项工作能为智能体时代的测试时扩展提供新的视角。