Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models), alongside representative proprietary models. We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.
翻译:在临床问答等高风险领域部署大语言模型时,可靠的不确定性量化至关重要。本研究首次聚焦于十一个临床专科和六种问题类型,评估了临床问答中的不确定性估计方法,涵盖了十个开源大语言模型(包括通用模型、生物医学模型和推理模型)以及代表性的专有模型。我们分析了基于评分的不确定性量化方法,通过案例研究提出了一种基于推理导向模型行为特征的轻量级新方法,并探讨了作为补充性集合方法的保形预测。研究结果表明,不确定性的可靠性并非单一属性,而是因校准和判别能力的变化而依赖于临床专科和问题类型。我们的结果凸显了需要根据模型独特且互补的优势及临床用途来选择或集成模型。