Language models have demonstrated strong performance on various natural language understanding tasks. Similar to humans, language models could also have their own bias that is learned from the training data. As more and more downstream tasks integrate language models as part of the pipeline, it is necessary to understand the internal stereotypical representation and the methods to mitigate the negative effects. In this paper, we proposed a simple method to test the internal stereotypical representation in pre-trained language models using counterexamples. We mainly focused on gender bias, but the method can be extended to other types of bias. We evaluated models on 9 different cloze-style prompts consisting of knowledge and base prompts. Our results indicate that pre-trained language models show a certain amount of robustness when using unrelated knowledge, and prefer shallow linguistic cues, such as word position and syntactic structure, to alter the internal stereotypical representation. Such findings shed light on how to manipulate language models in a neutral approach for both finetuning and evaluation.
翻译:语言模型在各种自然语言理解任务中表现出强大的性能。与人类相似,语言模型也可能从训练数据中习得自身偏见。随着越来越多的下游任务将语言模型整合为管道的一部分,理解其内部的刻板表征及减轻负面效应的方法变得至关重要。本文提出了一种利用反例测试预训练语言模型内部刻板表征的简洁方法。我们主要聚焦于性别偏见,但该方法可推广至其他类型的偏见。我们使用由知识型提示和基础型提示组成的9种完形填空式提示对模型进行了评估。结果表明,预训练语言模型在使用无关知识时表现出一定程度的鲁棒性,并且倾向于利用浅层语言线索(如词位置和句法结构)来改变内部刻板表征。这些发现为在微调和评估过程中以中立方式操控语言模型提供了启示。