Gender bias in language models has attracted sufficient attention because it threatens social justice. However, most of the current debiasing methods degraded the model's performance on other tasks while the degradation mechanism is still mysterious. We propose a theoretical framework explaining the three candidate mechanisms of the language model's gender bias. We use our theoretical framework to explain why the current debiasing methods cause performance degradation. We also discover a pathway through which debiasing will not degrade the model performance. We further develop a causality-detection fine-tuning approach to correct gender bias. The numerical experiment demonstrates that our method is able to lead to double dividends: partially mitigating gender bias while avoiding performance degradation.
翻译:语言模型中的性别偏见因其威胁社会公正而受到广泛关注。然而,当前多数去偏方法在消除偏见的同时,会降低模型在其他任务上的性能,其退化机制仍不明确。我们提出一个理论框架,阐释语言模型性别偏见的三种候选机制,并利用该框架解释现有去偏方法导致性能退化的原因。此外,我们发现一条可避免性能退化的去偏路径,进一步开发了一种基于因果检测的微调方法用于修正性别偏见。数值实验证明,我们的方法能够实现双重收益:部分缓解性别偏见的同时避免性能退化。