Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose LABDet, a robust language-agnostic method for evaluating bias in PLMs. For nationality as a case study, we show that LABDet "surfaces" nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. Collaborating with political scientists, we find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet's reliability and applicability to different templates and languages through an extensive set of robustness checks.
翻译:预训练语言模型(PLMs)是自然语言处理中的核心组件,但它们包含强烈的社会偏见。量化这些偏见具有挑战性,因为当前专注于掩码填充目标的方法对输入中的细微变化非常敏感。为解决这一问题,我们提出了LABDet,一种稳健的语言无关方法,用于评估PLM中的偏见。以国籍偏见为案例,我们通过在冻结的PLM之上训练一个分类器进行非国籍情感检测,展示LABDet如何“浮现”国籍偏见。与政治学家合作,我们在六种语言的单语PLM中发现了一致的国籍偏见模式,这些模式与历史和政治背景相符。此外,我们针对英语BERT模型证实,LABDet所浮现的偏见与预训练数据中的偏见高度相关;因此,我们的工作是少数直接将预训练数据与PLM行为联系起来的开创性研究之一。最后,通过一系列广泛的稳健性检验,我们验证了LABDet在不同模板和语言中的可靠性与适用性。