Pre-trained language models (PLMs) have revolutionized both the natural language processing research and applications. However, stereotypical biases (e.g., gender and racial discrimination) encoded in PLMs have raised negative ethical implications for PLMs, which critically limits their broader applications. To address the aforementioned unfairness issues, we present fairBERTs, a general framework for learning fair fine-tuned BERT series models by erasing the protected sensitive information via semantic and fairness-aware perturbations generated by a generative adversarial network. Through extensive qualitative and quantitative experiments on two real-world tasks, we demonstrate the great superiority of fairBERTs in mitigating unfairness while maintaining the model utility. We also verify the feasibility of transferring adversarial components in fairBERTs to other conventionally trained BERT-like models for yielding fairness improvements. Our findings may shed light on further research on building fairer fine-tuned PLMs.
翻译:预训练语言模型(PLMs)彻底改变了自然语言处理的研究与应用。然而,PLMs中编码的刻板偏见(例如性别与种族歧视)引发了负面的伦理影响,严重限制了其更广泛的应用。为解决上述公平性问题,我们提出了fairBERTs——一个通过生成对抗网络产生的语义与公平感知扰动来消除受保护敏感信息,从而学习公平微调BERT系列模型的通用框架。通过在两个真实世界任务上进行广泛的定性与定量实验,我们证明了fairBERTs在缓解不公平性同时保持模型效用的显著优越性。我们还验证了将fairBERTs中的对抗组件迁移至其他传统训练的类BERT模型以提升公平性的可行性。我们的发现可能为构建更公平的微调PLMs的进一步研究提供启示。