Theory of Mind (ToM) in Large Language Models (LLMs) refers to the model's ability to infer the mental states of others, with failures in this ability often manifesting as systemic implicit biases. Assessing this challenge is difficult, as traditional direct inquiry methods are often met with refusal to answer and fail to capture its subtle and multidimensional nature. Therefore, we propose MIST, which reconceptualizes the content model of stereotypes into multidimensional failures of ToM, specifically in the domains of competence, sociability, and morality. The framework introduces two indirect tasks. The Word Association Bias Test (WABT) assesses implicit lexical associations, while the Affective Attribution Test (AAT) measures implicit emotional tendencies, aiming to uncover latent stereotypes without triggering model avoidance. Through extensive experimentation on eight state-of-the-art LLMs, our framework demonstrates the ability to reveal complex bias structures and improved robustness. All data and code will be released.
翻译:大语言模型(LLM)中的心智理论(Theory of Mind, ToM)指模型推断他人心理状态的能力,该能力的缺失常表现为系统性的隐性偏见。评估这一挑战十分困难,因为传统的直接询问方法常遭遇模型拒绝回答,且难以捕捉其微妙的多维特性。为此,我们提出MIST框架,将刻板印象的内容模型重新概念化为心智理论在多维度的失效,具体涵盖能力、社交性与道德三个领域。该框架引入两项间接评估任务:词语关联偏见测试(Word Association Bias Test, WABT)用于评估隐性词汇关联,情感归因测试(Affective Attribution Test, AAT)用于测量隐性情感倾向,旨在不触发模型回避机制的前提下揭示潜在刻板印象。通过对八个前沿大语言模型的大规模实验,本框架展现出揭示复杂偏见结构与提升鲁棒性的能力。所有数据与代码将予以公开。