SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Large language models (LLMs) are increasingly tested for a "Theory of Mind" (ToM) - the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior, in diverse scenarios. We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes. First, it probes multiple levels of ToM reasoning, from mental state inference (explicit ToM) to behavior prediction and judgment (applied ToM). Second, it situates these tasks in diverse, everyday scenarios - such as supermarkets, hospitals, schools, and offices - where information asymmetries naturally arise (e.g., hidden defects in grocery store items, incomplete information in provider-patient interactions, or restricted access to locked devices). SimpleToM contains concise stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict: (a) mental states ("Is Mary aware of the mold?"), (b) behaviors ("Will Mary pay for the chips or report the mold?"), and (c) judgments ("Mary paid for the chips. Was that reasonable?"). Experiments reveal a striking gap: state-of-the-art models often reliably infer mental state (a), but fail at applying knowledge about the mental state for secondary predictions, with performance dropping sharply for behavior prediction (b) and further for behavior judgment (c). This exposes a critical fragility in LLMs' social reasoning in terms of what they know (explicit ToM) versus how well they can implicitly apply that knowledge for predictions (applied ToM).

翻译：大语言模型（LLMs）正越来越多地被测试是否具备"心智理论"（Theory of Mind，ToM）——即归因自我与他人心理状态的能力。然而，现有评估大多止步于经典玩具故事或程式化任务中的显式信念归因，尚未回答大语言模型能否在多样化场景中隐式应用此类知识来预测人类行为或评判观察到的行为。我们提出SimpleToM基准，该基准沿两个新维度推进心智理论评估：首先，它探究从心理状态推理（显式心智理论）到行为预测与评判（应用型心智理论）的多层次心智理论推理能力；其次，它将任务置于超市、医院、学校、办公室等多样化的日常场景中，这些场景天然存在信息不对称（例如：杂货商品的隐蔽缺陷、医患互动中的信息不全、或对锁定设备的访问受限）。SimpleToM包含简洁故事（例如："品客薯片罐内有发霉薯片。玛丽在超市拿起薯片罐走向收银台。"），每个故事配有三个测试不同层次心智理论推理的问题，要求模型预测：(a) 心理状态（"玛丽是否知晓霉变？"）；(b) 行为（"玛丽会支付薯片费用还是报告霉变？"）；(c) 评判（"玛丽支付了薯片费用。此举是否合理？"）。实验揭示了一个显著差距：最先进的模型通常能可靠推断心理状态(a)，但在应用心理状态知识进行次级预测时表现不佳，其行为预测(b)性能急剧下降，行为评判(c)性能进一步恶化。这暴露了大语言模型社会推理能力的核心脆弱性——其显式心智理论知识与隐式应用该知识进行预测的能力之间存在严重脱节。