Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
翻译:测试时扩展已成为提升神经网络模型性能与可靠性的标准方法。然而,其在多步骤智能体任务中的行为机制尚未得到充分理解:微小的单步误差会在长时序任务中不断累积;我们发现,采用均匀增加采样量的朴素策略会产生收益递减现象。本研究提出CATTS——一种为多步骤智能体动态分配计算资源的简洁技术。我们首先对网络智能体的推理时扩展进行了实证研究,发现均匀增加单步计算量在长时序环境中会迅速达到性能饱和。随后我们研究了更强的聚合策略,包括基于大语言模型的仲裁机制,该机制虽能超越朴素投票法,但可能推翻高共识决策。研究表明,从智能体自身投票分布推导出的不确定性统计量(熵值与top-1/top-2边际差)与下游任务成功率具有相关性,可为动态计算分配提供实用信号。基于这些发现,我们提出了置信感知测试时扩展方法(CATTS),该方法利用投票衍生的不确定性,仅在决策存在实质争议时分配计算资源。在WebArena-Lite和GoBrowse基准测试中,CATTS较React方法最高提升9.1%的性能,同时比均匀扩展策略减少2.3倍的令牌消耗,在提升效率的同时提供了可解释的决策规则。