Large language models have been shown to encode a variety of social biases, which carries the risk of downstream harms. While the impact of these biases has been recognized, prior methods for bias evaluation have been limited to binary association tests on small datasets, offering a constrained view of the nature of societal biases within language models. In this paper, we propose an original framework for probing language models for societal biases. We collect a probing dataset to analyze language models' general associations, as well as along the axes of societal categories, identities, and stereotypes. To this end, we leverage a novel perplexity-based fairness score. We curate a large-scale benchmarking dataset addressing drawbacks and limitations of existing fairness collections, expanding to a variety of different identities and stereotypes. When comparing our methodology with prior work, we demonstrate that biases within language models are more nuanced than previously acknowledged. In agreement with recent findings, we find that larger model variants exhibit a higher degree of bias. Moreover, we expose how identities expressing different religions lead to the most pronounced disparate treatments across all models.
翻译:大型语言模型已被证明编码了多种社会偏见,这带来了下游危害的风险。尽管这些偏见的影响已被认识,但先前的偏见评估方法局限于在小数据集上进行二元关联测试,对语言模型中社会偏见的本质提供了有限的视角。在本文中,我们提出了一种用于探测语言模型社会偏见的新颖框架。我们收集了一个探测数据集,用以分析语言模型的一般关联,以及沿着社会类别、身份和刻板印象的轴线进行分析。为此,我们利用了一种基于困惑度的创新公平性分数。我们策划了一个大规模基准测试数据集,解决了现有公平性数据集的缺陷和局限性,扩展到多种不同的身份和刻板印象。在将我们的方法与先前工作进行比较时,我们证明了语言模型中的偏见比以往承认的更为微妙。与近期发现一致,我们发现更大的模型变体表现出更高程度的偏见。此外,我们揭示了表达不同宗教信仰的身份如何导致所有模型中最显著的差别对待。