To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at https://anonymous.4open.science/r/Fairness_NC-457E .
翻译:为缓解近期成功的预训练语言模型中隐含编码的社会偏见,研究者提出了多种促进模型公平性的方法,重点关注提示工程、数据增强、正则化微调等方向。尽管方法不断发展,但达成对公平性的原理性理解并设计出能持续消除语言模型偏见的有效算法仍非易事。本工作中,通过对深度网络末层表征与分类器中发生的神经崩溃现象在公平性相关词汇上的严格评估,我们发现去偏后的语言模型呈现出词元表征与词嵌入之间的崩溃对齐特性。更重要的是,这一观察启发我们设计出一种原理性微调方法,该方法能在广泛使用的去偏方法中有效提升公平性,同时保持语言模型在标准自然语言理解任务上的性能。代码发布于 https://anonymous.4open.science/r/Fairness_NC-457E。