Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

from arxiv, 7 pages, 3 tables. Multi-model replication across Gemini, Claude, and GPT. Code and data: https://github.com/wongqihan/ai-behavioral-experiments

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.

翻译：我们研究了当仅患者的性别和年龄不同时，大语言模型是否会对相同的神经系统症状产生不同的医疗分诊建议。使用三个模型系列——Gemini 3.5 Flash、Claude Sonnet 4.6和GPT-5.4-mini——我们向七种人口统计条件呈现了标准化的症状特征（持续性头痛、视力模糊、晨起恶心、视觉障碍）：三种年龄组（25、38、65岁）×两种性别（男、女），外加一个性别未指定的基线（每种模型每条件n=30，共630次试验）。我们发现存在显著且系统性的性别依赖性分诊差异：年轻女性获得急诊科转诊率显著低于同龄男性（Gemini：0% vs. 23.3%；Claude：6.7% vs. 96.7%；GPT：6.7% vs. 66.7%，所有p值<0.001）。对于所有模型，这种差异在65岁时消失。主要机制是诊断替代：模型锚定于与性别相关的诊断，优先将年轻女性归类为特发性颅内高压——一种在流行病学上与育龄女性相关的疾病——而将男性诊断为伴有占位性病变的普遍性颅内压增高，并在鉴别诊断中考虑。这种诊断闭合导致女性患者被引导至较低紧急程度的护理（门诊预约），尽管其严重程度评分相当（7-9/10）。我们的研究结果表明，临床大语言模型通过使用流行病学先验来抑制分诊紧迫性，复制了已记录的人类临床偏见，这提示人工智能分诊引擎必须将紧急程度评估与概率性诊断先验解耦。我们已发布所有代码、提示和原始结果。