Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
翻译:大型语言模型的置信度是衡量模型不确定性和答案可靠性的有效指标。现有研究主要集中于单轮交互场景,而对复杂多轮交互中置信度的探讨仍较为有限。本文旨在探究基于大型语言模型的搜索智能体是否具备在长序列动作后通过语言化置信度分数表达自身置信度的能力——相较于单次交互中的置信度输出,这是一项显著更具挑战性的任务。通过对开源智能体模型进行实验,我们首先发现模型在高置信度时表现出显著更高的任务准确率,而在低置信度时准确率趋近于零。基于这一观察,我们提出了测试时缩放方法,该方法利用置信度分数评估答案质量,并激励模型持续尝试直至达到满意的置信度水平。实验结果表明,与基线固定预算的测试时缩放方法相比,我们提出的方法在显著降低计算资源消耗的同时,展现出具有竞争力的性能表现。