To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

翻译：心理理论（ToM）评估模型能否推断信念、欲望和意图等隐藏心理状态，这对自然社交互动至关重要。尽管近期大型推理模型（LRMs）在数学和编程的逐步推理方面取得进展，但这种优势能否迁移至社会认知技能仍待探索。我们对九种先进大型语言模型（LLMs）进行了系统研究，在三个代表性ToM基准上比较推理模型与非推理模型。结果表明，推理模型并未持续优于非推理模型，有时表现更差。细粒度分析揭示三点发现：第一，慢思考崩溃现象——随着回答长度增加，准确率显著下降，更大的推理预算反而损害性能；第二，适度自适应推理有益性能——限制推理长度可缓解失败，而不同的成功模式证明了动态适应的必要性；第三，选项匹配捷径——当移除多项选择选项时，推理模型表现显著提升，表明其依赖选项匹配而非真正演绎。我们还设计了两种干预方法：慢到快（S2F）自适应推理和思维到匹配（T2M）捷径预防，以进一步验证和缓解这些问题。综合所有结果，我们的研究强调LRMs在形式推理（如数学、代码）方面的进步无法完全迁移至社会推理典型任务ToM。我们得出结论：实现稳健的ToM需要发展超越现有推理方法的独特能力。