Large Language Models (LLMs) have recently shown a promise and emergence of Theory of Mind (ToM) ability and even outperform humans in certain ToM tasks. To evaluate and extend the boundaries of the ToM reasoning ability of LLMs, we propose a novel concept, taxonomy, and framework, the ToM reasoning with Zero, Finite, and Infinite Belief History and develop a multi-round text-based game, called $\textit{Pick the Right Stuff}$, as a benchmark. We have evaluated six LLMs with this game and found their performance on Zero Belief History is consistently better than on Finite Belief History. In addition, we have found two of the models with small parameter sizes outperform all the evaluated models with large parameter sizes. We expect this work to pave the way for future ToM benchmark development and also for the promotion and development of more complex AI agents or systems which are required to be equipped with more complex ToM reasoning ability.
翻译:大语言模型近期在心智理论能力方面展现出潜力与涌现性,在某些ToM任务中甚至超越人类表现。为评估并拓展LLMs的ToM推理能力边界,我们提出了一个新颖的概念体系、分类框架——基于零、有限与无限信念历史的心智理论推理,并开发了名为$\textit{Pick the Right Stuff}$的多轮文本游戏作为基准测试平台。通过对六种大语言模型的评估,我们发现其在零信念历史场景中的表现始终优于有限信念历史场景。此外,我们发现两个参数量较小的模型在所有评估的大参数量模型中表现最优。本研究期望为未来ToM基准测试体系的构建铺平道路,同时推动需要具备更复杂ToM推理能力的人工智能代理或系统的演进与发展。