Large language models (LLMs), trained on vast datasets, can carry biases that manifest in various forms, from overt discrimination to implicit stereotypes. One facet of bias is performance disparities in LLMs, often harming underprivileged groups, such as racial minorities. A common approach to quantifying bias is to use template-based bias probes, which explicitly state group membership (e.g. White) and evaluate if the outcome of a task, sentiment analysis for instance, is invariant to the change of group membership (e.g. change White race to Black). This approach is widely used in bias quantification. However, in this work, we find evidence of an unexpectedly overlooked consequence of using template-based probes for LLM bias quantification. We find that in doing so, text examples associated with White ethnicities appear to be classified as exhibiting negative sentiment at elevated rates. We hypothesize that the scenario arises artificially through a mismatch between the pre-training text of LLMs and the templates used to measure bias through reporting bias, unstated norms that imply group membership without explicit statement. Our finding highlights the potential misleading impact of varying group membership through explicit mention in bias quantification
翻译:大规模语言模型(LLMs)在庞大数据集上训练时,可能携带多种形式的偏见,从显性歧视到隐性刻板印象。偏见的其中一个方面是LLMs在性能上的差异,这种差异往往损害弱势群体,例如少数族裔。衡量偏见的常用方法是采用基于模板的偏见探针,这些探针明确陈述群体归属(例如“白人”),并评估任务结果(如情感分析)是否随群体归属的变化(例如将“白人”改为“黑人”)而保持不变。这种方法在偏见量化中被广泛使用。然而,在本研究中,我们发现了一个被意外忽视的后果证据:使用基于模板的探针对LLM偏见进行量化时,与白人种族相关的文本案例似乎被归类为表现出负面情感的比率升高。我们假设,这种情况是人为产生的,源于LLM预训练文本与用于测量偏见的模板之间的不匹配,这种不匹配通过报告偏差、未陈述的规范(隐含群体归属而不显式说明)导致。我们的发现突显了在偏见量化中通过显式提及来改变群体归属可能产生的误导性影响。