Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

翻译：大语言模型（LLM）能够通过显式的社会偏见测试，但仍可能隐含偏见，这与人类虽认同平等主义信念却仍表现出微妙偏见的情况相似。测量此类隐式偏见面临双重挑战：随着LLM日益专有化，可能无法获取其嵌入向量并应用现有偏见度量方法；此外，隐式偏见主要在其影响系统实际决策时才值得关注。我们通过引入两种新的偏见度量方法应对这些挑战：LLM隐式偏见——一种基于提示词揭示隐式偏见的方法；以及LLM决策偏见——一种检测决策任务中微妙歧视的策略。两种方法均基于心理学研究：LLM隐式偏见改编了广泛应用于研究人类心智中概念间自动关联的内隐联想测验；LLM决策偏见则借鉴了心理学研究成果，表明对两名候选者的相对评估（而非独立评估每人的绝对评估）更能诊断隐式偏见。运用这些方法，我们在4个社会类别（种族、性别、宗教、健康）的21种刻板印象（如种族与犯罪性、种族与武器、性别与科学、年龄与消极性）中，发现8个价值观对齐模型普遍存在反映社会现实的刻板印象偏见。我们基于提示词的LLM隐式偏见度量与现有基于语言模型嵌入的偏见方法具有相关性，但能更好地预测LLM决策偏见度量的下游行为。这些新型提示词度量方法借鉴了心理学基于纯可观测行为测量刻板印象偏见的长期研究传统，揭示了专有价值对齐LLM中那些在标准基准测试下看似无偏见实则存在的微妙偏见。