Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.
翻译:大型语言模型(LLM)中的偏见具有多种形式,从显性歧视到隐性刻板印象。反事实偏见评估是一种广泛用于量化偏见的方法,通常依赖于明确声明群体归属的基于模板的探针。该方法旨在衡量LLM执行任务的结果是否对群体归属的变化保持不变。本研究发现,基于模板的探针可能在偏见测量中引入系统性失真。具体而言,我们一致发现此类探针表明LLM将白人种族相关文本分类为负面的比例异常偏高。这一现象在大量LLM集合、多种不同的基于模板的探针以及不同分类方法中均被持续观察到。我们推测这种现象是人为产生的,源于LLM预训练数据中存在的语言不对称性(例如标记性差异:黑人总统 vs. 总统)以及用于偏见测量的模板结构(例如黑人总统 vs. 白人总统)。这些发现凸显了反事实偏见评估需要更严谨的方法论,以确保观察到的差异反映真实的偏见而非语言惯例的伪影。