Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at https://github.com/Veranep/MBBQ.
翻译:生成式大语言模型(LLMs)已被证明存在有害偏见和刻板印象。尽管安全性微调(若进行)通常以英语实施,但这些模型正被众多不同语言使用者所采用。现有证据表明,这些模型在不同语言间的表现存在不一致性,且会基于用户人口统计特征产生歧视性行为。受此驱动,本研究在控制文化差异与任务准确性的前提下,探究LLMs表现的社会刻板印象是否会随提示语言的变化而改变。为此,我们提出MBBQ(多语言问答偏见基准数据集)——通过对英文BBQ数据集进行精心扩展构建的荷兰语、西班牙语和土耳其语版本,用于测量这些语言中普遍存在的刻板印象。我们进一步为MBBQ配套构建了平行对照数据集,以独立于偏见地评估问答任务性能。基于多个开源与专有LLMs的实验结果表明,即使在控制文化迁移因素后,部分非英语语言仍比英语承受更严重的偏见。此外,我们观察到除最精确模型外的所有模型均存在显著的跨语言偏见行为差异。通过发布MBBQ,我们希望推动多语言环境下偏见研究的深入发展。数据集与代码已公开于https://github.com/Veranep/MBBQ。