Large language models (LLMs) have been shown to propagate and amplify harmful stereotypes, particularly those that disproportionately affect marginalised communities. To understand the effect of these stereotypes more comprehensively, we introduce GlobalBias, a dataset of 876k sentences incorporating 40 distinct gender-by-ethnicity groups alongside descriptors typically used in bias literature, which enables us to study a broad set of stereotypes from around the world. We use GlobalBias to directly probe a suite of LMs via perplexity, which we use as a proxy to determine how certain stereotypes are represented in the model's internal representations. Following this, we generate character profiles based on given names and evaluate the prevalence of stereotypes in model outputs. We find that the demographic groups associated with various stereotypes remain consistent across model likelihoods and model outputs. Furthermore, larger models consistently display higher levels of stereotypical outputs, even when explicitly instructed not to.
翻译:大型语言模型(LLMs)已被证明会传播并放大有害的刻板印象,尤其是那些对边缘化社群造成不成比例影响的偏见。为更全面地理解这些刻板印象的影响,我们引入了GlobalBias数据集,该数据集包含87.6万条句子,涵盖40个不同的性别-族裔群体以及偏见研究中常用的描述符,使我们能够研究来自世界各地的广泛刻板印象。我们使用GlobalBias通过困惑度直接探测一系列语言模型,将其作为衡量特定刻板印象在模型内部表征中体现程度的代理指标。在此基础上,我们基于给定姓名生成人物画像,并评估刻板印象在模型输出中的普遍性。研究发现,与各类刻板印象相关的人口统计群体在模型似然度和模型输出中保持高度一致。此外,更大规模的模型即使在被明确要求避免偏见时,仍持续表现出更高程度的刻板化输出。