Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It measures whether the outcome of a task, performed by an LLM, is invariant to a change of group membership. In this work, we find that template-based probes can lead to unrealistic bias measurements. For example, LLMs appear to mistakenly cast text associated with White race as negative at higher rates than other groups. We hypothesize that this arises artificially via a mismatch between commonly unstated norms, in the form of markedness, in the pretraining text of LLMs (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). The findings highlight the potential misleading impact of varying group membership through explicit mention in counterfactual bias quantification.
翻译:大型语言模型(LLM)中的偏差具有多种形式,从明显的歧视到隐含的刻板印象。反事实偏差评估是一种广泛用于量化偏差的方法,通常依赖于明确陈述群体成员身份的基于模板的探针。该方法衡量LLM执行任务的结果是否对群体成员身份的变化保持不变。在本研究中,我们发现基于模板的探针可能导致不切实际的偏差测量。例如,LLM似乎错误地将与白人种族相关的文本判定为负面的比率高于其他群体。我们假设,这种偏差是由于LLM预训练文本中通常未陈述的规范(以标记性形式存在,例如“黑人总统”与“总统”)与用于偏差测量的模板(例如“黑人总统”与“白人总统”)之间的不匹配人为产生的。这些发现凸显了在反事实偏差量化中通过明确提及来改变群体成员身份可能产生的误导性影响。