Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

翻译：大语言模型（LLMs）已广泛应用于开放式交流场景，然而大多数偏见评估仍依赖于英语分类式任务。我们提出了DebateBias-8K——一个全新的多语言辩论式基准测试，旨在揭示叙事偏见如何在现实生成场景中显现。该数据集包含8,400个结构化辩论提示，涵盖四大敏感领域：女性权利、社会经济发展、恐怖主义与宗教，涉及七种从高资源（英语、中文）到低资源（斯瓦希里语、尼日利亚皮钦语）的语言。通过使用四个旗舰模型（GPT-4o、Claude 3、DeepSeek和LLaMA 3），我们生成并自动分类了超过10万条响应。结果显示，尽管经过安全对齐，所有模型仍再现了根深蒂固的刻板印象：阿拉伯人绝大多数与恐怖主义和宗教相关联（≥95%），非洲人则与社会经济“落后性”相关联（最高达≤77%），而西方群体则被一致塑造为现代或进步的代表。在低资源语言中，偏见现象急剧加剧，这表明主要在英语数据上训练的对齐机制未能实现全球化泛化。我们的研究结果凸显了多语言公平性领域持续存在的鸿沟：当前的对齐方法虽能减少显性有害内容，却无法在开放式语境中阻止偏见输出的产生。我们公开DebateBias-8K基准测试与分析框架，以支持下一代多语言偏见评估及更安全、更具文化包容性的模型对齐研究。