迈向可靠的医学大语言模型：医学咨询中大语言模型置信度估计的基准测试与增强 (Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation)

Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.

翻译：大规模语言模型（LLMs）常基于不完整信息提供临床判断，增加了误诊风险。现有研究主要在单轮静态场景中评估置信度，忽视了真实咨询过程中随着临床证据积累而产生的置信度与正确性之间的耦合关系，这限制了其对可靠决策的支持。我们提出了首个面向真实医学咨询多轮交互场景的置信度评估基准。该基准统一整合了三种类型的医学数据以支持开放式诊断生成，并引入信息充分性梯度来刻画证据增加过程中置信度-正确性的动态变化。我们在该基准上实现并比较了27种代表性方法，得到两个关键发现：（1）医学数据放大了词元级和一致性级置信度方法的固有局限性；（2）医学推理必须同时评估诊断准确性和信息完整性。基于这些发现，我们提出了MedConf——一个基于证据的语言自评估框架，该框架通过检索增强生成构建症状画像，将患者信息与支持性、缺失性和矛盾性关系进行对齐，并通过加权集成将其聚合为可解释的置信度估计。在两个LLM和三个医学数据集上的实验表明，MedConf在AUROC和皮尔逊相关系数指标上均持续优于现有最优方法，且在信息不足和多病共存条件下保持稳定性能。这些结果证明信息充分性是构建可信医学置信度模型的关键决定因素，为构建更可靠、可解释的大型医学模型提供了新路径。