Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.
翻译:大型语言模型(LLM)的心理理论(ToM)能力指其推理心理状态的能力,而该能力的缺失常表现为系统性的隐性偏见。评估此类偏见具有挑战性,因为传统的直接询问方法易受社会期望效应影响,且难以捕捉其微妙的多维特性。为此,我们提出一种评估框架,利用刻板印象内容模型(SCM)将偏见重新概念化为跨越能力、社交性与道德三个维度的心理理论失效。该框架引入两项间接测试任务:用于评估隐性词汇关联的词联想偏见测试(WABT),以及用于测量隐蔽情感倾向的情感归因测试(AAT)。两项任务均设计为在不触发模型回避机制的前提下探测潜在刻板印象。在8个前沿大型语言模型上的大量实验表明,本框架能够揭示复杂的偏见结构,包括普遍存在的社交性偏见、多维偏差以及非对称刻板印象放大现象,从而为识别隐性偏见的结构性本质提供了更稳健的方法论。