The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.

翻译：基于大语言模型（LLM）的自主智能体正快速发展以处理多轮任务，但确保其可信性仍是关键挑战。可信性的一个基本支柱是校准，即智能体表达置信度的能力需可靠反映其实际性能。虽然静态模型的校准研究已较为成熟，但工具集成智能体工作流中的校准动态机制仍未得到充分探索。本研究系统性地探究了工具使用智能体中的言语化校准，揭示了由工具类型驱动的根本性置信度二分现象。具体而言，我们的初步研究发现：证据型工具（如网络搜索）由于检索信息固有的噪声，会系统性地引发严重过度自信；而验证型工具（如代码解释器）可通过确定性反馈锚定推理过程，从而缓解校准偏差。为全面提升跨工具类型的校准性能，我们提出了一个强化学习（RL）微调框架，该框架通过整体奖励设计基准的支持，联合优化任务准确性与校准度。实验表明，经训练的智能体不仅能实现更优的校准性能，还能从局部训练环境稳健地泛化至嘈杂的网络环境，并适应数学推理等不同领域。我们的研究结果凸显了针对工具使用智能体制定领域特定校准策略的必要性。更广泛而言，这项工作为构建能够在高风险现实场景中可靠传达不确定性的自感知智能体奠定了理论基础。