The automatic assessment of translation quality has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. Although quality estimation (QE) metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic groups over others. This study is the first to investigate gender bias in QE metrics and its downstream impact on machine translation (MT). Focusing on out-of-English translations into languages with grammatical gender, we ask: Do contemporary QE metrics exhibit gender bias? Can the use of contextual information mitigate this bias? How does QE influence gender bias in MT outputs? Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Moreover, context-aware QE metrics reduce errors for masculine-inflected references but fail to address feminine referents, exacerbating gender disparities. Additionally, QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding. Our findings underscore the need to address gender bias in QE metrics to ensure equitable and unbiased MT systems.
翻译:翻译质量的自动评估最近在翻译流程的多个阶段变得至关重要,从数据整理到训练和解码。尽管质量评估(QE)指标已被优化以与人类判断保持一致,但尚未有人关注这些指标可能存在的偏见,特别是在强化某些人口群体相对于其他群体的可见性和可用性方面。本研究首次调查了QE指标中的性别偏见及其对机器翻译(MT)的下游影响。聚焦于从英语到具有语法性别语言的翻译,我们提出以下问题:当代QE指标是否表现出性别偏见?上下文信息的使用能否缓解这种偏见?QE如何影响MT输出中的性别偏见?通过对多个领域、数据集和语言的最先进QE指标进行实验,发现了显著的偏见。男性屈折变化的翻译得分高于女性屈折变化的翻译,而性别中立的翻译则受到惩罚。此外,上下文感知的QE指标减少了男性屈折变化参考的误差,但未能解决女性指称问题,反而加剧了性别差异。另外,当在质量感知解码中使用时,QE指标可能使MT系统中的性别偏见长期存在。我们的研究结果强调了解决QE指标中性别偏见的必要性,以确保MT系统的公平性和无偏见性。