Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.
翻译:社会偏见评估主要关注模型是否生成或暗示带有偏见的内容。然而,随着大语言模型越来越多地被用作偏见评判者,它们在评估偏见内容时可能以更隐蔽的方式表现出社会偏见,现有方法无法系统捕捉这一现象。我们称之为"二阶偏见":即大语言模型对社会偏见判断中存在的偏见。本文通过一个基于哲学原理的新型推理任务来评估这种偏见。借鉴资格认识论,我们将偏见概念化为塑造主体理性探究的错误基础性知识,由此推导出一个逻辑推理任务,要求大语言模型判断有偏见文本对哪些群体具有可接受性或不可接受性。我们开发了两个简单指标来测量大语言模型评判者在缺乏充分支持时推测人口统计学特征的可接受性中存在的偏见程度,以及这些推断如何随文本所针对的目标群体而变化。通过评估开源与闭源模型,我们发现本任务通过暴露模型判断中的偏见成功规避了安全护栏。这些偏见在不同目标群体间呈现系统性差异,反映了隐含的社会认知图谱,并表明模型仍会因人口统计学标签而触发偏见反应。本研究揭示了在判断任务中进行大语言模型偏见评估的必要性,并广泛呼吁采用更具理论基础的方法进行自然语言处理中的偏见评估。我们在 https://github.com/uofthcdslab/second-order-bias 开放了代码与模型响应。