AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups based on attributes like gender or race but fail to address the underlying issue of reliance on protected attributes. To partly fix that, we introduce NLPGuard, a framework for mitigating the reliance on protected attributes in NLP classifiers. NLPGuard takes an unlabeled dataset, an existing NLP classifier, and its training data as input, producing a modified training dataset that significantly reduces dependence on protected attributes without compromising accuracy. NLPGuard is applied to three classification tasks: identifying toxic language, sentiment analysis, and occupation classification. Our evaluation shows that current NLP classifiers heavily depend on protected attributes, with up to $23\%$ of the most predictive words associated with these attributes. However, NLPGuard effectively reduces this reliance by up to $79\%$, while slightly improving accuracy.
翻译:人工智能法规预计将禁止机器学习模型在训练中使用敏感属性。然而,最新的依赖于深度学习的自然语言处理(NLP)分类器作为黑盒系统运行,这使得检测和纠正此类滥用变得复杂。NLP 中传统的偏见缓解方法旨在基于性别或种族等属性在不同群体间实现可比的性能,但未能解决模型依赖受保护属性的根本问题。为部分解决此问题,我们引入了 NLPGuard,这是一个用于缓解 NLP 分类器对受保护属性依赖的框架。NLPGuard 以一个未标记的数据集、一个现有的 NLP 分类器及其训练数据作为输入,生成一个修改后的训练数据集,该数据集能在不损害准确性的情况下显著减少对受保护属性的依赖。NLPGuard 被应用于三个分类任务:识别有毒语言、情感分析和职业分类。我们的评估表明,当前的 NLP 分类器严重依赖受保护属性,高达 $23\%$ 的最具预测性的词汇与这些属性相关。然而,NLPGuard 有效地将这种依赖降低了高达 $79\%$,同时略微提高了准确性。