To mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. Considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of this. In this work, we design a probe to investigate the effects that some of the major intrinsic gender bias mitigation strategies have on downstream text classification tasks. We discover that instead of resolving gender bias, intrinsic mitigation techniques and metrics are able to hide it in such a way that significant gender information is retained in the embeddings. Furthermore, we show that each mitigation technique is able to hide the bias from some of the intrinsic bias measures but not all, and each intrinsic bias measure can be fooled by some mitigation techniques, but not all. We confirm experimentally, that none of the intrinsic mitigation techniques used without any other fairness intervention is able to consistently impact extrinsic bias. We recommend that intrinsic bias mitigation techniques should be combined with other fairness interventions for downstream tasks.
翻译:为了缓解上下文语言模型中的性别偏见,研究者提出了多种内在缓解策略及众多偏见度量指标。考虑到这些语言模型的最终用途是文本分类等下游任务,理解这些内在偏见缓解策略如何实际转化为下游任务的公平性及其转化程度至关重要。本研究设计了一种探测方法,系统考察了几种主要内在性别偏见缓解策略对下游文本分类任务的影响。我们发现,内在缓解技术和度量指标非但没有解决性别偏见,反而能够将其隐藏,使得嵌入中仍保留大量性别信息。进一步研究表明,每种缓解技术虽能规避部分内在偏见度量,但无法全部规避;同样,每种内在偏见度量虽能被部分缓解技术欺骗,但无法被全部欺骗。通过实验验证,在未结合其他公平性干预措施的情况下,任何内在缓解技术均无法持续影响外在偏见。我们建议,针对下游任务,应将内在偏见缓解技术与其它公平性干预手段结合使用。