Large Language Models have difficulty communicating uncertainty, which is a significant obstacle to applying LLMs to complex medical tasks. This study evaluates methods to measure LLM confidence when suggesting a diagnosis for challenging clinical vignettes. GPT4 was asked a series of challenging case questions using Chain of Thought and Self Consistency prompting. Multiple methods were investigated to assess model confidence and evaluated on their ability to predict the models observed accuracy. The methods evaluated were Intrinsic Confidence, SC Agreement Frequency and CoT Response Length. SC Agreement Frequency correlated with observed accuracy, yielding a higher Area under the Receiver Operating Characteristic Curve compared to Intrinsic Confidence and CoT Length analysis. SC agreement is the most useful proxy for model confidence, especially for medical diagnosis. Model Intrinsic Confidence and CoT Response Length exhibit a weaker ability to differentiate between correct and incorrect answers, preventing them from being reliable and interpretable markers for model confidence. We conclude GPT4 has a limited ability to assess its own diagnostic accuracy. SC Agreement Frequency is the most useful method to measure GPT4 confidence.
翻译:大型语言模型在表达不确定性方面存在困难,这成为将LLM应用于复杂医疗任务的主要障碍。本研究评估了在针对复杂临床案例提出诊断建议时测量LLM置信度的方法。通过思维链和自一致性提示方法,向GPT4提出一系列具有挑战性的病例问题。我们研究了多种评估模型置信度的方法,并检验其预测模型观测准确率的能力。评估的方法包括:内在置信度、自一致性协议频率与思维链响应长度。研究显示,自一致性协议频率与观测准确率相关,其受试者工作特征曲线下面积高于内在置信度与思维链长度分析。自一致性协议是模型置信度最有用的替代指标,尤其在医疗诊断领域。模型内在置信度与思维链响应长度在区分正确与错误回答方面能力较弱,因此不能作为可靠且可解释的模型置信度标记。我们得出结论,GPT4评估自身诊断准确率的能力有限。自一致性协议频率是测量GPT4置信度的最有效方法。