Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at https://github.com/Veranep/MBBQ.
翻译:生成式大语言模型(LLMs)已被证明存在有害偏见与刻板印象。尽管安全性微调通常仅针对英语(若存在),这些模型正被众多不同语言的使用者所采用。现有证据表明,这些模型在不同语言间的表现存在不一致性,且会基于用户的人口统计学特征产生歧视性行为。受此驱动,本研究在控制文化差异与任务准确性的前提下,探究LLMs所表现的社会刻板印象是否会因提示语言的不同而产生差异。为此,我们提出MBBQ(多语言偏见问答基准数据集)——一个在英语BBQ数据集基础上精心构建并扩展至荷兰语、西班牙语与土耳其语的版本,用于衡量这些语言中普遍存在的刻板印象。我们进一步为MBBQ配套构建了平行对照数据集,以独立于偏见地评估问答任务性能。基于多个开源与专有LLMs的实验结果表明,即使控制文化差异后,部分非英语语言仍比英语承受更严重的偏见。此外,我们发现除最精确模型外的所有模型均存在显著的跨语言偏见行为差异。通过发布MBBQ,我们希望推动多语言环境下偏见研究的深入发展。数据集与代码已公开于https://github.com/Veranep/MBBQ。