Bias auditing of language models (LMs) has received considerable attention as LMs are becoming widespread. As such, several benchmarks for bias auditing have been proposed. At the same time, the rapid evolution of LMs can make these benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness? We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between different types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation. We demonstrate that the newly created dataset BBNLI-next is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3%, as observed by BBNLI, to a strikingly low 57.5%. Second, we employ BBNLI-next to showcase the interplay between robustness and bias: we point out shortcomings in current bias scores and propose bias measures that take into account both bias and model brittleness. Third, despite the fact that BBNLI-next was designed with non-generative models in mind, we show that the new dataset is also able to uncover bias in state-of-the-art open-source generative LMs. Note: All datasets included in this work are in English and they address US-centered social biases. In the spirit of efficient NLP research, no model training or fine-tuning was performed to conduct this research. Warning: This paper contains offensive text examples.
翻译:随着语言模型(LMs)的广泛应用,对其偏见的审计已受到相当多的关注。为此,已提出了若干用于偏见审计的基准。与此同时,语言模型的快速演进可能使这些基准在短时间内变得不再适用。语言模型的脆弱性进一步加剧了偏见审计的复杂性:当观察到可能带有偏见的结果时,这究竟源于模型偏见还是模型脆弱性?我们建议利用模型自身来帮助构建持续具有挑战性的偏见审计数据集,并引入能够区分不同类型模型错误的偏见度量。首先,我们通过结合LM生成的词汇变体、对抗性过滤和人工验证,扩展了一个现有的用于自然语言推理(NLI)的偏见基准(BBNLI)。我们证明,新创建的数据集BBNLI-next比BBNLI更具挑战性:平均而言,BBNLI-next将最先进NLI模型的准确率从BBNLI所观察到的95.3%显著降低至惊人的57.5%。其次,我们利用BBNLI-next来展示鲁棒性与偏见之间的相互作用:我们指出了当前偏见评分的不足,并提出了同时考虑偏见和模型脆弱性的偏见度量。第三,尽管BBNLI-next在设计时主要考虑非生成式模型,但我们表明新数据集同样能够揭示最先进开源生成式语言模型中的偏见。注:本工作中包含的所有数据集均为英文,并针对以美国为中心的社会偏见。本着高效NLP研究的精神,本研究未进行任何模型训练或微调。警告:本文包含冒犯性文本示例。