Interpretability can be implemented to understand decisions taken by (black box) models, such as neural machine translation (NMT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. In this work, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this exploratory study examines which context, in the form of input tokens in the source sentence (EN), influences (or triggers) the NMT model's choice of a certain gender inflection in the target languages (DE/ES). To analyse this, we compute saliency attribution based on contrastive translations. We first address the challenge of the lack of a scoring threshold and specifically examine different attribution levels of source words on the model's gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
翻译:可解释性方法可用于理解(黑盒)模型的决策过程,例如神经机器翻译模型或大语言模型。然而,针对这些模型中一个显著存在的问题——性别偏见,该领域的研究仍较为有限。本研究旨在超越简单的偏见度量,转而探究其产生根源。本探索性研究使用性别模糊的自然源语言数据,考察源语句中的哪些上下文信息会促使神经机器翻译模型在目标语言中选择特定的性别屈折形式。为分析此问题,我们基于对比翻译计算显著性归因。首先,我们解决了评分阈值缺失的挑战,特别考察了源语词汇在不同归因水平上对模型性别决策的影响。通过将显著源语词汇与人类对性别的感知进行对比,我们证明了人类感知与模型归因之间存在显著重叠。此外,我们对显著词汇进行了语言学分析。本研究揭示了从性别角度理解模型翻译决策的重要性,展示了其与人类决策的异同,并指出应利用此类信息来缓解性别偏见。