A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.
翻译:在微调语言模型(LMs)中缓解社会偏见的一个挑战是可能降低语言建模能力,从而损害下游性能。反事实数据增强(CDA)作为一种广泛使用的微调方法,突显了这一问题:它生成的合成数据可能与现实世界分布匹配不佳,或创建过于简化的反事实,忽略了预训练语料库中已更改敏感属性(如性别)的社会背景。为解决这些局限性,我们提出了一种简单而有效的上下文增强CDA方法,即Context-CDA,该方法利用大型语言模型来增强去偏见语料库的多样性和上下文相关性。通过增强上下文最小化去偏见语料库与预训练数据之间的差异,该方法确保了更好的对齐,从而提升了语言建模能力。随后,我们采用基于不确定性的过滤机制,排除目标较小语言模型(即待去偏见的LMs)认为质量较低的反事实生成内容,进一步提高了微调语料库的质量。在性别偏见基准测试上的实验结果表明,Context-CDA在有效缓解偏见的同时不牺牲语言建模性能,并通过分析下一个词生成概率的分布变化,为社会偏见提供了深入见解。