Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably, FairMedQA exposes biases that are at least 12 percentage points larger than those identified by the latest CPV benchmark, presenting superior benchmarking sensitivity. Our results underscore an urgent need for targeted debiasing techniques and more rigorous, identity-aware validation protocols before LLMs can be safely integrated into practical clinical decision-support systems.
翻译:大型语言模型(LLMs)在医学问答(QA)任务中正接近专家级表现,展现出改善公共医疗健康的巨大潜力。然而,与性别和种族等敏感属性相关的潜在偏见带来了危及生命的风险。此类敏感属性对诊断的影响程度仍是一个悬而未决的问题,需要进行全面的实证研究。此外,即使最新的反事实患者变异(CPV)基准也几乎无法区分不同LLMs的偏见水平。为深入探究这些动态,我们提出了一个新的基准——FairMedQA,并对12个代表性LLMs进行了基准测试。FairMedQA包含基于801个临床案例构建的4,806对反事实问题。我们的研究结果显示,不同敏感人口群体间的准确率差异高达3至19个百分点。值得注意的是,FairMedQA揭示的偏见至少比最新CPV基准识别的偏见高出12个百分点,展现出更优越的基准测试敏感性。这些结果强调,在将LLMs安全整合到实际临床决策支持系统之前,迫切需要针对性的去偏见技术和更严格、具备身份感知能力的验证方案。