The integration of large language models (LLMs) into the medical field has gained significant attention due to their promising accuracy in simulated clinical decision-making settings. However, clinical decision-making is more complex than simulations because physicians' decisions are shaped by many factors, including the presence of cognitive bias. However, the degree to which LLMs are susceptible to the same cognitive biases that affect human clinicians remains unexplored. Our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a novel benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.
翻译:大语言模型(LLM)在医学领域的整合因其在模拟临床决策场景中展现出的高准确性而备受关注。然而,临床决策比模拟更为复杂,因为医生的决策受到多种因素影响,包括认知偏差的存在。但LLM是否同样会受到影响人类临床医生的认知偏差影响,目前尚不明确。我们假设:当LLM面对包含认知偏差的临床问题时,其回答准确性将显著低于无偏差的同类问题。本研究开发了BiasMedQA这一新基准,用于评估LLM在医学任务中的认知偏差。运用BiasMedQA,我们评估了六种LLM模型:GPT-4、Mixtral-8x70B、GPT-3.5、PaLM-2、Llama 2 70B-chat以及医学专用模型PMC Llama 13B。这些模型接受了来自美国执业医师资格考试(USMLE)第1、2、3阶段的1273道试题测试,所有试题均经过修改以复现临床相关的典型认知偏差。分析显示,各偏差对LLM的影响存在差异:GPT-4在抗偏差方面表现突出,而Llama 2 70B-chat和PMC Llama 13B则受到认知偏差的显著影响。研究结果凸显了在开发医学LLM时进行偏差消除的紧迫性,为构建更安全可靠的医疗应用指明了方向。