Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.
翻译:借鉴心理学概念,先前研究已识别出大型语言模型(LLMs)中显性偏见与隐性偏见的区别。尽管许多LLMs经过训练后对齐和安全程序以避免表达显性社会偏见,但在类似于内隐联想测试(IAT)的间接任务中,它们仍表现出显著的隐性偏见。近期研究进一步表明,推理时推理会损害LLMs在依赖隐性统计学习任务上的表现。基于人类认知中隐性联想与统计学习之间的理论联系,我们研究了启用推理的推断如何影响LLMs的隐性偏见。我们发现,在涵盖十五个刻板印象主题的IAT式评估中,启用推理能显著降低某些模型类别的隐性偏见测量值。这种效应似乎特定于社会偏见领域,因为在非社会性隐性联想中我们未观察到相应减少。随着推理功能在部署的LLMs中日益成为默认设置,这些发现表明推理可能实质性改变某些系统的公平性评估结果,同时也引发了对齐程序如何与推理时推理相互作用以驱动不同模型类型偏见减少差异的疑问。更广泛而言,本研究凸显了认知科学与心理学理论如何通过提供方法论和解释框架来补充人工智能评估研究,从而揭示模型行为的新见解。