Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

翻译：大语言模型正日益嵌入医疗工作流程，用于文书记录、教育和临床决策支持。然而，这些系统基于编码了现有偏见（包括诊断和治疗中的性别差异）的大规模文本语料库进行训练，引发了此类模式可能被复制或放大的担忧。我们系统性地检验了当代大语言模型是否在临床推理中表现出性别特异性偏见，以及模型配置如何影响这些行为。我们使用50个由临床医生撰写的病例进行了三项实验，这些病例涵盖44个专科，其中性别信息对初始诊断路径无提示作用。测试了四个通用大语言模型（ChatGPT (gpt-4o-mini)、Claude 3.7 Sonnet、Gemini 2.0 Flash 和 DeepSeekchat）。所有模型均表现出显著的性别分配偏斜，且预测的性别因模型而异。在温度为0.5时，ChatGPT在70%的病例中分配为女性性别（95% CI 0.66-0.75），DeepSeek为61%（0.57-0.65），Claude为59%（0.55-0.63），而Gemini则表现出男性偏斜，仅在36%的病例中分配为女性性别（0.32-0.41）。当代大语言模型在临床推理中表现出稳定且模型特异性的性别偏见。允许模型弃权虽能减少显式标签分配，但无法消除下游诊断差异。要实现安全的临床整合，需要在医疗环境中部署通用模型时采取保守且有据可查的配置、专科层级的临床数据审计，并持续保持人工监督。