Transformer-based pretrained large language models (PLM) such as BERT and GPT have achieved remarkable success in NLP tasks. However, PLMs are prone to encoding stereotypical biases. Although a burgeoning literature has emerged on stereotypical bias mitigation in PLMs, such as work on debiasing gender and racial stereotyping, how such biases manifest and behave internally within PLMs remains largely unknown. Understanding the internal stereotyping mechanisms may allow better assessment of model fairness and guide the development of effective mitigation strategies. In this work, we focus on attention heads, a major component of the Transformer architecture, and propose a bias analysis framework to explore and identify a small set of biased heads that are found to contribute to a PLM's stereotypical bias. We conduct extensive experiments to validate the existence of these biased heads and to better understand how they behave. We investigate gender and racial bias in the English language in two types of Transformer-based PLMs: the encoder-based BERT model and the decoder-based autoregressive GPT model. Overall, the results shed light on understanding the bias behavior in pretrained language models.
翻译:基于Transformer的预训练大型语言模型(PLM)如BERT和GPT在自然语言处理任务中取得了显著成功。然而,PLM容易编码刻板印象偏差。尽管已有大量关于PLM中刻板印象偏差缓解的研究,例如针对性别和种族刻板印象的去偏工作,但这些偏差在PLM内部如何表现和运作仍大多未知。理解内部刻板印象机制有助于更好地评估模型公平性,并指导制定有效的缓解策略。在本工作中,我们聚焦于Transformer架构的主要组件——注意力头,提出了一种偏差分析框架,用以探索并识别一小部分被认为对PLM刻板印象偏差有贡献的偏差注意力头。我们进行了大量实验验证这些偏差注意力头的存在性,并深入理解其行为模式。我们研究了两种基于Transformer的PLM(编码器型BERT模型和解码器型自回归GPT模型)中的英语性别与种族偏差。总体而言,本研究结果为理解预训练语言模型中的偏差行为提供了新见解。