Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.
翻译:确保自然语言处理模型的公平性至关重要,因为这些模型常常编码性别和种族等敏感属性,导致有偏见的输出。现有的概念擦除方法试图通过修改最终潜在表示来移除敏感信息,而无需重新训练整个模型,以缓解这一问题。然而,这些方法通常依赖于线性分类器,这使得模型容易受到能够恢复敏感信息的非线性攻击者的攻击。我们提出了目标概念擦除(TaCo),这是一种新颖的方法,可从最终潜在表示中移除敏感信息,确保即使面对非线性分类器也能保持公平性。我们的实验表明,TaCo在保持整体任务性能的同时,显著降低了非线性分类器对敏感属性的预测准确率,其表现优于现有最先进的方法。代码可在 https://github.com/fanny-jourdan/TaCo 获取。