The fairness of Natural Language Processing (NLP) models has emerged as a crucial concern. Information theory indicates that to achieve fairness, a model should not be able to predict sensitive variables, such as gender, ethnicity, and age. However, information related to these variables often appears implicitly in language, posing a challenge in identifying and mitigating biases effectively. To tackle this issue, we present a novel approach that operates at the embedding level of an NLP model, independent of the specific architecture. Our method leverages insights from recent advances in XAI techniques and employs an embedding transformation to eliminate implicit information from a selected variable. By directly manipulating the embeddings in the final layer, our approach enables a seamless integration into existing models without requiring significant modifications or retraining. In evaluation, we show that the proposed post-hoc approach significantly reduces gender-related associations in NLP models while preserving the overall performance and functionality of the models. An implementation of our method is available: https://github.com/fanny-jourdan/TaCo
翻译:自然语言处理(NLP)模型的公平性已成为一项关键关注点。信息论指出,为实现公平性,模型不应能够预测性别、种族和年龄等敏感变量。然而,与这些变量相关的信息常隐现于语言中,对有效识别并缓解偏见构成挑战。为解决此问题,我们提出了一种新颖方法,该方法在NLP模型的嵌入层面操作,且独立于特定架构。我们的方法利用了近期可解释人工智能(XAI)技术进展的见解,并采用嵌入变换来消除选定变量中的隐含信息。通过直接操作最终层的嵌入,我们的方法能够无缝集成到现有模型中,无需重大修改或重新训练。在评估中,我们展示了所提出的后处理方法显著降低了NLP模型中的性别相关关联,同时保持了模型的整体性能与功能。我们方法的实现代码已公开:https://github.com/fanny-jourdan/TaCo