Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance. WSE quantifies uncertainty in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ). We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs). Experimental results demonstrate that WSE exhibits superior performance in UQ under two standard criteria for correctness evaluation. Additionally, in terms of real-world medical QA applications, the performance of LLMs is significantly enhanced (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) by employing responses with lower uncertainty that are identified by WSE as final answers, without any additional task-specific fine-tuning or architectural modifications.
翻译:不确定性估计对于安全关键的人机交互与人工智能(AI)交互系统的可靠性至关重要,尤其在医疗健康工程领域。然而,在开放式医学问答任务中,针对自由形式答案的鲁棒且普适的不确定性度量尚未完善,其中生成的不平等性在用于不确定性量化的生成集合中引入了大量无关词汇与序列,可能导致偏差。本文提出词序列熵(WSE),一种在词与序列层级校准不确定性的方法,同时考虑语义相关性。WSE 以一种更贴近大型语言模型在不确定性量化过程中可靠性的方式量化不确定性。我们在五个自由形式医学问答数据集上,利用七种主流大型语言模型,将 WSE 与六种基线方法进行比较。实验结果表明,在两种标准正确性评估准则下,WSE 在不确定性量化中均表现出更优性能。此外,在实际医学问答应用方面,通过采用由 WSE 识别出的低不确定性响应作为最终答案,大型语言模型的性能得到显著提升(例如在 COVID-QA 数据集上模型准确率提升 6.36%),且无需任何额外的任务特定微调或架构修改。