Addressing cognitive bias in medical language models

There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.

翻译：大型语言模型在医疗领域的应用日益引起关注，部分原因在于其在医学考试题目中展现的出色表现。尽管前景可期，但考试题目并不能反映真实的医患互动复杂性。实际上，医生的决策受到诸多复杂因素的影响，如患者依从性、个人经验、伦理观念和认知偏差。为深入理解这一现象，我们提出假设：当大型语言模型面对含有认知偏差的临床问题时，与不含此类偏差的相同问题相比，其回答准确性会显著降低。本研究开发了BiasMedQA基准，用于评估应用于医疗任务的大型语言模型中的认知偏差。利用BiasMedQA，我们评估了六种大型语言模型，包括GPT-4、Mixtral-8x70B、GPT-3.5、PaLM-2、Llama 2 70B-chat以及医学专业化的PMC Llama 13B。我们对这些模型进行了美国医师资格考试（USMLE）第1、2、3阶段的1273道题目的测试，这些题目经过修改以模拟临床常见认知偏差。分析显示，不同偏差对各类大型语言模型的影响各异，其中GPT-4对偏差展现出了较强的抗干扰能力，而Llama 2 70B-chat和PMC Llama 13B受认知偏差影响尤为显著。我们的研究结果凸显了在医学大型语言模型开发中消除偏差的紧迫性，为医疗领域更安全、更可靠的应用指明了方向。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日