Natural language processing models tend to learn and encode social biases present in the data. One popular approach for addressing such biases is to eliminate encoded information from the model's representations. However, current methods are restricted to removing only linearly encoded information. In this work, we propose Iterative Gradient-Based Projection (IGBP), a novel method for removing non-linear encoded concepts from neural representations. Our method consists of iteratively training neural classifiers to predict a particular attribute we seek to eliminate, followed by a projection of the representation on a hypersurface, such that the classifiers become oblivious to the target attribute. We evaluate the effectiveness of our method on the task of removing gender and race information as sensitive attributes. Our results demonstrate that IGBP is effective in mitigating bias through intrinsic and extrinsic evaluations, with minimal impact on downstream task accuracy.
翻译:自然语言处理模型倾向于学习并编码数据中存在的社会偏见。消除模型表征中已编码的信息是解决此类偏见的一种流行方法。然而,现有方法仅局限于消除线性编码的信息。本文提出了一种新颖的方法——迭代梯度投影(IGBP),用于消除神经表征中的非线性编码概念。该方法通过迭代训练神经网络分类器来预测我们希望消除的特定属性,随后将表征投影到超曲面上,使得分类器对该目标属性不再敏感。我们以去除性别和种族信息作为敏感属性来评估该方法的有效性。结果表明,IGBP在内在评估和外在评估中均能有效缓解偏见,同时最大限度降低对下游任务准确性的影响。