As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.
翻译:随着语言模型(LM)日趋强大且应用广泛,量化其可能造成危害的社会人口偏见至关重要。先前的偏见度量方法对用于比较不同社会群体性能的模板扰动极为敏感,这源于模板多样性不足或数量有限等因素。此外,以往研究大多仅关注单一NLP任务。我们提出语言模型综合评估(CALM),用于稳健测量两种普遍相关的社会人口偏见:性别与种族。CALM整合了涵盖问答、情感分析和自然语言推理的十六个数据集。从每个数据集中筛选样本,构建224个高多样性模板(如长度、词汇差异),并为七个不同人口群体各选取50个高频人名,生成覆盖三项NLP任务的78,400条提示。实证评估表明,相较于先前的偏见度量方法,CALM偏见分数对模板扰动(如同义词替换或随机子集选择)的敏感性显著降低,具有更强稳健性。我们将CALM应用于20个大型语言模型,发现两个模型系列中参数规模较大的模型往往比小参数模型更易产生偏见。在所研究的20个LLM中,T0系列是偏见程度最低的模型家族。完整代码已发布于https://github.com/vipulgupta1011/CALM。