Societal biases are reflected in large pre-trained language models and their fine-tuned versions on downstream tasks. Common in-processing bias mitigation approaches, such as adversarial training and mutual information removal, introduce additional optimization criteria, and update the model to reach a new debiased state. However, in practice, end-users and practitioners might prefer to switch back to the original model, or apply debiasing only on a specific subset of protected attributes. To enable this, we propose a novel modular bias mitigation approach, consisting of stand-alone highly sparse debiasing subnetworks, where each debiasing module can be integrated into the core model on-demand at inference time. Our approach draws from the concept of \emph{diff} pruning, and proposes a novel training regime adaptable to various representation disentanglement optimizations. We conduct experiments on three classification tasks with gender, race, and age as protected attributes. The results show that our modular approach, while maintaining task performance, improves (or at least remains on-par with) the effectiveness of bias mitigation in comparison with baseline finetuning. Particularly on a two-attribute dataset, our approach with separately learned debiasing subnetworks shows effective utilization of either or both the subnetworks for selective bias mitigation.
翻译:社会性偏差不仅体现在大型预训练语言模型中,也反映在其针对下游任务微调的版本上。常见的处理中偏差缓解方法(如对抗训练和互信息移除)会引入额外的优化准则,并通过更新模型使其达到新的去偏状态。然而在实践中,终端用户与从业者可能更倾向于切换回原始模型,或仅对特定受保护属性子集应用去偏处理。为此,我们提出一种新颖的模块化偏差缓解方法,该方法由高度稀疏的独立去离子网络构成,每个去偏模块可在推理阶段按需集成至核心模型。本方法借鉴了差异剪枝(diff pruning)理念,并提出了一种适用于多种表示解耦优化的新型训练框架。我们在三个分类任务上进行了实验,将性别、种族和年龄作为受保护属性。结果表明,与基线微调相比,我们的模块化方法在保持任务性能的同时,能够提升(或至少持平)偏差缓解效果。特别是在双属性数据集上,通过分别学习的去离子网络,本方法能有效利用单一或组合子网络实现选择性偏差缓解。