Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at https://github.com/Veranep/MBBQ.
翻译:摘要:生成式大语言模型已被证明会表现出有害的偏见和刻板印象。尽管安全微调通常(如果存在的话)以英语进行,但这些模型正被使用多种不同语言的用户所使用。已有证据表明,这些模型的表现在不同语言间存在不一致性,且会基于用户的人口统计学特征进行歧视。受此启发,本研究在控制文化差异与任务准确性的前提下,探究大语言模型展现的社会刻板印象是否会因提示语言的不同而存在差异。为此,我们提出了MBBQ(多语言问答偏见基准),这是英文BBQ数据集的精炼扩展版本,涵盖荷兰语、西班牙语和土耳其语,专门测量这些语言中普遍存在的刻板印象。我们进一步为MBBQ补充了平行控制数据集,用于独立于偏见测量问答任务的性能表现。基于多个开源与专有LLM的实验结果证实,即便在控制文化偏移的情况下,某些非英语语言的偏见问题仍比英语更严重。此外,我们发现除最精确的模型外,所有模型在偏见行为上均存在显著跨语言差异。通过发布MBBQ数据集,我们期望能推动多语言环境下的偏见研究进一步发展。数据集与代码可通过https://github.com/Veranep/MBBQ获取。