As language models (LMs) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. Prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. To achieve reliability, we introduce the Comprehensive Assessment of Language Model bias (CALM), a benchmark dataset to quantify bias in LMs across three tasks. We integrate 16 existing datasets across different domains, such as Wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. We compare the diversity of CALM with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. We show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. We evaluate 20 large language models including six prominent families of LMs such as Llama-2. In two LM series, OPT and Bloom, we found that larger parameter models are more biased than lower parameter models. We found the T0 series of models to be the least biased. Furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. The code is available at https://github.com/vipulgupta1011/CALM.
翻译:随着语言模型(LM)能力日益增强,量化并比较其可能造成危害的社会人口学偏见至关重要。现有偏见测量数据集对人工设计模板中的微小扰动较为敏感,因此可靠性不足。为提升可靠性,我们提出了语言模型偏见综合评估(CALM)基准数据集,用于在三个任务中量化LM偏见。我们整合了来自维基百科和新闻文章等不同领域的16个现有数据集,筛选出224个模板,并据此构建了包含78,400个样本的数据集。我们以平均语义相似度、模板长度方差等指标比较了CALM与先前数据集的多样性,并测试了其对微小扰动的敏感性。结果表明,该数据集在多样性及可靠性上均优于已有数据集,从而能更有效地捕捉评估模型偏见所需语言变化的广度。我们对20个大型语言模型(包括Llama-2等六个主要LM系列)进行了评估。在OPT和Bloom两个LM系列中,我们发现参数规模更大的模型比参数较小的模型偏见更严重,而T0系列模型偏见程度最低。此外,我们注意到某些模型系列中,随着模型规模增大,性别偏见与种族偏见之间存在权衡。代码已开源:https://github.com/vipulgupta1011/CALM。