Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.
翻译:大语言模型已成为辅助临床决策过程的强大候选工具。尽管这些模型在塑造数字领域方面发挥着日益重要的作用,但在医疗应用中出现了两个日益增长的关切:1) 基于患者受保护属性(如种族),LLMs在多大程度上表现出社会偏见;2) 设计选择(如架构设计和提示策略)如何影响观察到的偏见?为严谨回答这些问题,我们使用标准化用于偏见评估的临床案例(患者描述),在三个问答数据集上评估了八种主流LLMs。我们采用红队策略分析人口统计学特征如何影响LLM输出,对比通用模型与临床训练模型。广泛实验揭示了受保护群体间存在多种(部分显著的)差异。同时观察到若干反直觉模式,例如较大模型未必偏见更少,基于医疗数据微调的模型未必优于通用模型。此外,研究表明提示设计对偏见模式具有显著影响,特定措辞可改变偏见模式,而反思型方法(如思维链)能有效减少偏见结果。与既往研究一致,我们呼吁对用于临床决策支持的LLMs进行更多评估、审查与改进。