Quality estimation (QE) -- the automatic assessment of translation quality -- has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by exacerbating gaps in visibility and usability. This paper defines and investigates gender bias of QE metrics and discusses its downstream implications for machine translation (MT). Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. When a human entity's gender in the source is undisclosed, masculine-inflected translations score higher than feminine-inflected ones and gender-neutral translations are penalized. Even when contextual cues disambiguate gender, using context-aware QE metrics leads to more errors in picking the correct translation inflection for feminine than masculine referents. Moreover, a biased QE metric affects data filtering and quality-aware decoding. Our findings highlight the need for renewed focus in developing and evaluating QE metrics centered around gender.
翻译:质量评估(QE)——对翻译质量的自动评估——近年来在翻译流程的多个阶段(从数据整理到训练和解码)变得至关重要。尽管QE指标已针对与人类判断的一致性进行了优化,但它们是否编码社会偏见在很大程度上被忽视了。带有偏见的QE指标可能导致偏向某些人口群体而损害其他群体,例如通过加剧可见性和可用性方面的差距。本文定义并研究了QE指标的性别偏见,并讨论了其对机器翻译(MT)的下游影响。在多个领域、数据集和语言上使用最先进的QE指标进行的实验揭示了显著的偏见。当源语言中人类实体的性别未明确时,男性屈折变化的翻译得分高于女性屈折变化的翻译,而性别中立的翻译则受到惩罚。即使上下文线索能够消除性别歧义,使用上下文感知的QE指标也会导致在选取正确翻译屈折形式时,对女性指称比对男性指称产生更多错误。此外,带有偏见的QE指标会影响数据过滤和质量感知解码。我们的研究结果强调,需要重新聚焦于以性别为中心的QE指标的开发和评估。