We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
翻译:我们研究了大型语言模型是否仅仅根据患者提示的语言,对相同症状给出不同的医疗分诊建议。使用 Gemini 3.5 Flash,我们评估了一种神经症状特征(持续性头痛、视力模糊、恶心)在六种语言(英语、西班牙语、中文、印地语、日语、阿拉伯语)下的表现,每种条件运行30次(总计n=450次API调用)。我们发现,尽管模型在所有语言中分配了几乎相同的严重程度评分(7.7-8.0/10),但其推荐急诊就诊的比例从0%(日语、印地语)到30%(英语、阿拉伯语)不等。仅添加一句指定患者位于美国的句子,就使非英语提示的急诊推荐比例提高了高达76.7个百分点;而反向锚定(英语提示搭配东京地点)则使急诊率从30%降至6.7%。回译控制(日语到英语)产生的急诊率与英语基线相当,证实了这种差异并非由翻译质量引起,而是由输入语言所触发的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。