Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.
翻译:大型语言模型(LLMs)与机器翻译(MT)系统日益融入日常生活,但其输出可能复现训练数据中存在的性别偏见。目前用于评估此类偏见的大多数资源均针对英语设计,并反映其社会文化背景,这限制了它们对其他语言的适用性。本研究通过引入两个新数据集来填补这一空白,旨在评估涉及巴斯克语(一种低资源且无性别标记的语言)的翻译任务中的性别偏见。WinoMTeus 基准基于 WinoMT 框架构建,用于考察性别中立的巴斯克语职业词汇如何被翻译成西班牙语和法语等有性别标记的语言。FLORES+Gender 则扩展了 FLORES+ 基准,用于评估从有性别标记的语言(西班牙语和英语)翻译至巴斯克语时,翻译质量是否因指称对象的性别而异。我们对多个通用 LLMs 及开源与商业 MT 系统进行了评估。结果显示,这些模型普遍存在对男性形式的系统性偏好,部分模型对男性指称对象的翻译质量略高。总体而言,这些发现表明性别偏见仍深植于现有模型中,并凸显了开发兼顾语言特征与文化背景的评估方法的必要性。