Large language models are becoming the go-to solution for various language tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey biases. Based on the analysis results, we adapt the model by multiplying these layers by a linear projection. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.
翻译:大规模语言模型正成为各类语言任务的首选解决方案。然而,随着模型能力的增强,模型容易依赖训练数据中存在的偏见和刻板印象所导致的虚假相关性。本研究提出了一种新颖的方法,用于检测并减轻语言模型中的性别偏见。我们通过因果分析识别有问题的模型组件,发现中上层的前馈层最容易传递偏见。基于分析结果,我们通过将这些层乘以线性投影来适配模型。我们的同名方法DAMA在多种指标上显著降低了偏见,同时保持了模型在下游任务中的性能。我们发布了方法及模型的代码,这些代码在保持LLaMA最先进性能的同时显著减少了偏见。