Large language models (LLMs) can pass explicit bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both of these challenges by introducing two measures of bias inspired by psychology: LLM Implicit Association Test (IAT) Bias, which is a prompt-based method for revealing implicit bias; and LLM Decision Bias for detecting subtle discrimination in decision-making tasks. Using these measures, we found pervasive human-like stereotype biases in 6 LLMs across 4 social domains (race, gender, religion, health) and 21 categories (weapons, guilt, science, career among others). Our prompt-based measure of implicit bias correlates with embedding-based methods but better predicts downstream behaviors measured by LLM Decision Bias. This measure is based on asking the LLM to decide between individuals, motivated by psychological results indicating that relative not absolute evaluations are more related to implicit biases. Using prompt-based measures informed by psychology allows us to effectively expose nuanced biases and subtle discrimination in proprietary LLMs that do not show explicit bias on standard benchmarks.
翻译:大语言模型(LLMs)能够通过显性偏见测试,但仍可能蕴含隐性偏见——类似于人类虽秉持平等信念却仍表现出微妙偏见。测量此类隐性偏见存在两大挑战:随着LLMs日益商业化,研究者可能无法获取其嵌入表示并应用现有偏见评估方法;此外,隐性偏见的真正隐患在于其影响系统实际决策。为应对这些挑战,我们引入两种源自心理学的偏见测量方法:LLM隐性关联测试(IAT)偏差——通过提示词揭示隐性偏见;以及LLM决策偏差——检测决策任务中的隐蔽歧视。运用这些方法,我们在6个LLMs中发现了跨4个社会领域(种族、性别、宗教、健康)及21个类别(武器、犯罪、科学、职业等)普遍存在类人刻板印象偏见。基于提示词的隐性偏见测量与嵌入方法的相关性显著,且能更准确地预测LLM决策偏差所表征的下游行为。该测量方法要求LLM在个体间进行选择,其设计灵感源于心理学研究发现:相较于绝对评估,相对评估更能揭示隐性偏见。采用心理学启发的提示词测量方法,可有效揭露那些在标准基准测试中不显示显性偏见的商业LLMs中隐藏的微妙偏见与隐蔽歧视。