The vast majority of work on gender in MT focuses on 'unambiguous' inputs, where gender markers in the source language are expected to be resolved in the output. Conversely, this paper explores the widespread case where the source sentence lacks explicit gender markers, but the target sentence contains them due to richer grammatical gender. We particularly focus on inputs containing person names. Investigating such sentence pairs casts a new light on research into MT gender bias and its mitigation. We find that many name-gender co-occurrences in MT data are not resolvable with 'unambiguous gender' in the source language, and that gender-ambiguous examples can make up a large proportion of training examples. From this, we discuss potential steps toward gender-inclusive translation which accepts the ambiguity in both gender and translation.
翻译:机器翻译(MT)中绝大多数关于性别的研究聚焦于“无歧义”输入,即源语言中的性别标记应在输出中得到消解。然而,本文探讨的是源句缺乏显式性别标记、而目标句因更丰富的语法性别系统必然包含性别标记的普遍情况。我们特别关注包含人名的输入。研究此类句对为MT性别偏见及其缓解研究提供了新视角。我们发现,MT数据中的大量姓名-性别共现无法通过源语言的“无歧义性别”加以消解,且性别歧义样例可能占据训练样本的绝大部分。基于此,我们探讨了迈向性别包容翻译的潜在路径——这种翻译需接纳性别与翻译双重层面的歧义性。