Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8{,}400 structured debate prompts spanning four sensitive domains -- Women's Rights, Backwardness, Terrorism, and Religion -- across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude~3.5~Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100{,}000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion ($\geq$89\%), Africans to socioeconomic ``backwardness'' (up to 77\%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our \corpusname benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

翻译：大型语言模型（LLMs）已广泛部署于开放式交流场景，但多数偏见评估仍局限于英语和分类型任务。我们提出新型多语言辩论式基准测试语料库\corpusname，旨在揭示叙事偏见在生成式现实场景中的表现方式。该数据集包含8400个结构化辩论提示，涵盖女性权利、落后性、恐怖主义和宗教信仰四个敏感领域，涉及七种语言（从英语、中文等高资源语言到斯瓦希里语、尼日利亚皮钦语等低资源语言）。通过使用GPT-4o、Claude 3.5 Haiku、DeepSeek-Chat和LLaMA-3-70B四款旗舰模型，我们生成了超过10万条辩论回复，并自动分类哪些人口群体被赋予刻板角色与现代角色。结果显示：尽管经过安全对齐，所有模型均复现了根深蒂固的刻板印象——阿拉伯群体被强烈关联至恐怖主义和宗教信仰（≥89%），非洲群体被关联至社会经济"落后性"（高达77%），而西方群体始终被刻画为现代或进步形象。在低资源语言中，偏见程度显著加剧，表明主要基于英语训练的对齐策略无法实现全球泛化。研究发现揭示了多语言公平性的持续鸿沟：当前对齐方法虽能减少显性毒性，但无法阻止开放语境中的偏见输出。我们将发布\corpusname基准测试与分析框架，以支持下一代多语言偏见评估及更安全、更具文化包容性的模型对齐研究。