Recently, language models have demonstrated strong performance on various natural language understanding tasks. Language models trained on large human-generated corpus encode not only a significant amount of human knowledge, but also the human stereotype. As more and more downstream tasks have integrated language models as part of the pipeline, it is necessary to understand the internal stereotypical representation in order to design the methods for mitigating the negative effects. In this paper, we use counterexamples to examine the internal stereotypical knowledge in pre-trained language models (PLMs) that can lead to stereotypical preference. We mainly focus on gender stereotypes, but the method can be extended to other types of stereotype. We evaluate 7 PLMs on 9 types of cloze-style prompt with different information and base knowledge. The results indicate that PLMs show a certain amount of robustness against unrelated information and preference of shallow linguistic cues, such as word position and syntactic structure, but a lack of interpreting information by meaning. Such findings shed light on how to interact with PLMs in a neutral approach for both finetuning and evaluation.
翻译:近期,语言模型在各类自然语言理解任务中展现出卓越性能。基于大规模人类生成语料训练的语言模型不仅编码了海量人类知识,还承载着人类社会的刻板印象。随着越来越多下游任务将语言模型整合为处理流程的组成部分,理解其内部刻板印象表征对于设计缓解负面效应的方法至关重要。本文采用反例来检验预训练语言模型中可能导致刻板偏好的内隐刻板知识。我们主要聚焦性别刻板印象,但该方法可拓展至其他刻板类型。通过在9类融合不同信息与基础知识的完形填空式提示上评估7种预训练语言模型,结果表明:预训练语言模型对无关信息展现出一定鲁棒性,且偏好词位置、句法结构等浅层语言线索,但缺乏基于语义的信息解析能力。这些发现为如何在微调与评估过程中以中立方式与预训练语言模型交互提供了启示。