Theory of Mind (ToM) is a critical component of intelligence, yet accurately measuring it continues to be a subject of debate. Prior research has attempted to apply human ToM assessments to natural language processing models using either human-created standardized tests or rule-based templates. However, these methods primarily focus on simplistic reasoning and require further validation. In this study, we utilize dynamic epistemic logic, which has established overlaps with ToM, to generate more intricate problems. We also introduce novel verbalization techniques to express these problems using natural language. Our findings indicate that certain language model scaling (from 70M to 6B and 350M to 174B) does not consistently yield results better than random chance. While GPT-4 demonstrates improved epistemic reasoning capabilities, there is still room for enhancement. Our code and datasets are publicly available https://github.com/antoinelrnld/modlog https://huggingface.co/datasets/sileod/mindgames
翻译:心智理论是智力的关键组成部分,但其精确测量方法仍存在争议。先前研究尝试通过人工构建的标准化测试或基于规则的模板,将人类心智理论评估应用于自然语言处理模型。然而,这些方法主要聚焦于简单推理,且需进一步验证。本研究利用已建立与心智理论关联的动态认知逻辑,生成更复杂的问题,并引入新型语言化技术用自然语言表达这些问题。研究结果表明,特定语言模型规模扩展(从70M至6B及350M至174B)并未持续产生优于随机猜测的结果。尽管GPT-4展现出更强的认知推理能力,但仍存在改进空间。我们的代码与数据集已公开于https://github.com/antoinelrnld/modlog 及https://huggingface.co/datasets/sileod/mindgames。