Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g. women vs. men) remains an open challenge. This paper presents a novel method for mitigating biases in neural text classification, agnostic to the model architecture. Considering the difficulty to distinguish fair from unfair information in a text encoder, we take inspiration from adversarial training to induce Wasserstein independence between representations learned to predict our target label and the ones learned to predict some sensitive attribute. Our approach provides two significant advantages. Firstly, it does not require annotations of sensitive attributes in both testing and training data. This is more suitable for real-life scenarios compared to existing methods that require annotations of sensitive attributes at train time. Second, our approach exhibits a comparable or better fairness-accuracy trade-off compared to existing methods.
翻译:摘要:群体公平性是文本分类中的核心研究课题,确保敏感群体(如女性与男性)间获得公平对待仍是一项开放挑战。本文提出一种与模型架构无关的新型神经网络文本分类偏见缓解方法。鉴于文本编码器中难以区分公平与非公平信息,我们借鉴对抗训练思想,诱导用于预测目标标签的表征与用于预测敏感属性的表征之间实现Wasserstein独立性。该方法具有两个显著优势:首先,它无需在测试与训练数据中标注敏感属性,相较于现有需在训练时标注敏感属性的方法更适用于实际场景;其次,与现有方法相比,本方法在公平性-准确率权衡方面达到相当或更优水平。