While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
翻译:尽管置信度估计是缓解大语言模型幻觉问题的有效方向,但现有研究主要集中于单轮交互场景。在多轮对话中,随着上下文累积和歧义逐步消解,模型置信度的动态变化机制尚未得到充分探索。可靠的多轮置信度估计对于自主智能体和人机协同系统等下游应用至关重要。本研究首次系统性地探讨多轮交互中的置信度估计问题,建立了基于两大核心需求的正式评估框架:单轮校准性及信息增量下的置信度单调性。为此,我们提出了创新性评估指标,包括长度归一化的期望校准误差,以及用于生成受控评估数据集的“提示者-猜测者”新范式。实验表明,当前广泛使用的置信度技术在多轮对话中难以满足校准性与单调性要求。我们提出的基于逻辑值的探测方法P(Sufficient)取得了相对更优的性能,但该任务仍远未完全解决。本研究成果为开发更可靠、可信的对话智能体提供了基础方法论。