Large language models are becoming the go-to solution for various language tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey biases. Based on the analysis results, we adapt the model by multiplying these layers by a linear projection. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.
翻译:大型语言模型正成为各类语言任务的首选解决方案。然而,随着模型容量的增长,模型容易依赖训练数据中偏见和刻板印象导致的虚假相关性。本文提出了一种新颖的语言模型性别偏见检测与缓解方法。我们通过因果分析识别有问题的模型组件,发现中上层前馈层最容易传播偏见。基于分析结果,我们通过将这些层乘以线性投影来适配模型。我们的同名方法DAMA在保持下游任务性能的同时,显著降低了多种指标衡量的偏见。我们公开了方法和模型的代码,这些代码在保持LLaMA最先进性能的同时显著减少了偏见。