Five years after the discovery of persistent anti-Muslim bias in large language models, most evaluations remain confined to single-turn prompt completion, a setting that no longer reflects how frontier LLMs are deployed. We introduce \textbf{MIRAGE} (Muslim-Identity Reasoning and Agentic Generation Evaluation), a benchmark of 1{,}200 prompts spanning three deployment-realistic conditions: direct completion, chain-of-thought reasoning, and simulated agentic decision-making across content moderation, lending triage, refugee claim summarization, and hiring screens. Across six frontier models, we find that (i) chain-of-thought reasoning \emph{amplifies} rather than suppresses Muslim-violence associations by 12--34\% relative to direct completion, (ii) agentic decisions exhibit a 9--22 percentage-point asymmetry between Muslim and matched non-Muslim cases on identical evidence, and (iii) bias is sharply time-coupled to retrieved news context, increasing 18--27\% under recent-conflict retrieval. Existing prompt-based mitigations transfer poorly across our three conditions, suppressing direct-completion bias while leaving agentic asymmetry largely intact. We release MIRAGE and an open evaluation harness to support targeted mitigation research.
翻译:在大型语言模型中根深蒂固的反穆斯林偏见被发现五年后,大多数评估仍局限于单轮提示补全这一场景,而该场景已不再反映前沿大语言模型的实际部署方式。我们提出**MIRAGE**(穆斯林身份推理与智能代理生成评估)基准,包含1,200个提示,覆盖三种贴近部署现实的场景:直接补全、思维链推理,以及跨内容审核、贷款分类、难民申请摘要和招聘筛选的模拟智能代理决策过程。针对六种前沿模型,我们发现:(i)与直接补全相比,思维链推理使穆斯林与暴力的关联性增强而非抑制,幅度达12%–34%;(ii)面对相同证据,智能代理决策在穆斯林与匹配的非穆斯林案例间呈现9–22个百分点的偏差不对称性;(iii)偏见与检索到的新闻语境存在显著时间耦合,在近期冲突检索下增加18%–27%。现有的基于提示的缓解策略在三种场景下迁移效果较差,能抑制直接补全的偏见,却基本无法消除智能代理决策中的不对称性。我们发布MIRAGE及开放评估框架,以支持针对性的缓解研究。