Theory of Mind (ToM) is a critical component of intelligence but its assessment remains the subject of heated debates. Prior research applied human ToM assessments to natural language processing models using either human-created standardized tests or rule-based templates. However, these methods primarily focus on simplistic reasoning and require further validation. Here, we leverage dynamic epistemic logic to isolate a particular component of ToM and to generate controlled problems. We also introduce new verbalization techniques to express these problems in English natural language. Our findings indicate that some language model scaling (from 70M to 6B and 350M to 174B) does not consistently yield results better than random chance. While GPT-4 demonstrates superior epistemic reasoning capabilities, there is still room for improvement. Our code and datasets are publicly available (https://huggingface.co/datasets/sileod/mindgames , https://github.com/sileod/llm-theory-of-mind )
翻译:心智理论(ToM)是智能的关键组成部分,但其评估方法仍是激烈争论的焦点。先前的研究通过人工创建的标准化测试或基于规则的模板,将人类ToM评估应用于自然语言处理模型。然而,这些方法主要关注简单推理,且需要进一步验证。本文利用动态认知逻辑分离ToM的特定组成部分,并生成受控问题。同时,我们引入新的语言化技术,将这些问题的英语自然语言形式表达出来。研究结果表明,某些语言模型规模的扩展(从70M到6B以及从350M到174B)并未持续取得优于随机水平的结果。尽管GPT-4展现出卓越的认知推理能力,但仍有改进空间。我们的代码与数据集已公开(https://huggingface.co/datasets/sileod/mindgames,https://github.com/sileod/llm-theory-of-mind)。