Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.
翻译:大型语言模型通过在海量语料库上进行预训练来学习事实和人类认知,其中包含人类偏好。然而,这一过程可能导致模型无意中习得社会中普遍存在的偏见与刻板印象。既有研究通常从单一维度处理偏见问题,要么侧重于定位偏见,要么侧重于缓解偏见。这种局限性视角为促进偏见研究形成协同互补、循序渐进的累积式发展设置了障碍。本研究将偏见定位与缓解过程整合至统一框架中。首先,我们采用因果中介分析追踪大型语言模型内部不同组件激活的因果效应。在此基础上,提出基于知识编辑的LSDM(最小二乘去偏方法),用于缓解职业代词的性别偏见,并在三个性别偏见数据集和七个知识能力测试数据集上与两种基线方法进行对比。实验结果表明,性别偏见的主要贡献来自作用于职业代词最后一个词元的底层MLP模块,以及作用于句子末尾词元的顶层注意力模块。此外,LSDM在更有效缓解模型性别偏见的同时,完整保留了模型在其他方面的所有能力。