Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition -- their own API. We show that this access to the responses of a blinded replica enables fairer decisions, while providing greater transparency to distinguish implicit from intentionally biased behavior.
翻译:公平决策要求忽略无关且可能带有偏见的信息。为实现这一点,决策者需要近似模拟在不知道某些事实(如求职者的性别或种族)的情况下会做出何种决策。这种反事实自模拟对人类而言极为困难,导致即使善意行为者也会产生偏见判断。本文表明,大型语言模型在抵消性别与种族偏见、克服谄媚倾向时,其近似模拟反事实知识下决策的能力存在类似局限。我们发现,通过提示让模型忽略或假装不知道偏见信息,不仅未能抵消这些偏见,有时甚至会产生反效果。然而,与人类不同,LLMs 可被赋予访问其自身反事实认知的基准真值模型——即其自身的应用程序接口。研究表明,这种对盲化副本响应的访问机制能够促成更公平的决策,同时提供更高透明度以区分隐性偏见与故意偏颇行为。