Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
翻译:大型语言模型日益支持多种语言,然而大多数性别偏见基准仍以英语为中心。我们提出了EuroGEST,这是一个旨在衡量LLMs在英语及29种欧洲语言中性别刻板印象推理能力的数据集。EuroGEST基于一个现有的、由专家知识构建的涵盖16种性别刻板印象的基准,在本工作中,我们利用翻译工具、质量评估指标和形态学启发式方法对其进行了扩展。人工评估证实,我们的数据生成方法在不同语言中均实现了高精度的翻译和性别标签标注。我们使用EuroGEST评估了来自六个模型系列的24个多语言语言模型,结果表明,在所有模型和所有语言中,最强烈的刻板印象是:女性是“美丽的”、“有同理心的”和“整洁的”,而男性是“领导者”、“强壮、坚韧的”和“专业的”。我们还发现,更大的模型编码性别刻板印象的程度更强,而指令微调并不能持续地减少性别刻板印象。我们的工作凸显了在LLMs公平性方面进行更多多语言研究的必要性,并为跨语言审计性别偏见提供了可扩展的方法和资源。