Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet `surfaces' nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet's reliability and applicability to different templates and languages through an extensive set of robustness checks. We publicly share our code and dataset in https://github.com/akoksal/LABDet.
翻译:预训练语言模型(PLMs)是自然语言处理中的关键组件,但其中包含强烈的社会偏见。量化这些偏见具有挑战性,因为当前聚焦于完形填空目标的方法对输入细微变化极为敏感。为解决此问题,我们提出一种名为LABDet的偏见探测技术,通过鲁棒且语言无关的方法评估PLM中的社会偏见。以国籍为例,我们证明LABDet通过在冻结的PLM上训练非国籍情感分类器,能够"浮现"出国籍偏见。我们发现六种语言单语PLM中存在与历史政治背景高度一致的国籍偏见模式。此外,针对英语BERT模型,LABDet所浮现的偏见与其预训练数据中的偏见显著相关;因此,本研究是为数不多的直接将预训练数据与PLM行为关联的工作之一。最后,我们通过大量稳健性检验验证了LABDet在不同模板与语言中的可靠性与适用性。代码与数据集已在https://github.com/akoksal/LABDet公开共享。