Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models, there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on large language models and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating Theory of Mind based on Sally-Anne and Smarties tests. We created 30 variations of each test (e.g., changing the person's name, location, and items). For each variation, we test the model's understanding of different aspects: reality, belief, 1st order belief, and 2nd order belief. We adapt our data for various tasks by creating unique prompts tailored for each task category: Fill-in-the-Blank, Multiple Choice, True/False, Chain-of-Thought True/False, Question Answering, and Text Completion. If the model has a robust ToM, it should be able to achieve good performance for different prompts across different tests. We evaluated two GPT-3.5 models, text-davinci-003 and gpt-3.5-turbo-0301, with our datasets. Our results indicate that consistent performance in ToM tasks remains a challenge.
翻译:心智理论(Theory of Mind, ToM)是指理解不同个体心理状态的能力,对众多实际应用至关重要。随着大型语言模型的发展,关于它们能否执行ToM任务的讨论日益激烈。以往研究采用不同任务和提示词测试大型语言模型的ToM能力,结果存在分歧:部分研究声称这些模型能展现ToM,而另一些则认为相反。本研究提出ToMChallenges——一个基于Sally-Anne测试和Smarties测试、全面评估心智理论的数据集。我们为每项测试创建30个变体(如改变人物姓名、位置和物品)。针对每个变体,我们测试模型对现实、信念、一级信念和二级信念等不同方面的理解。通过为每类任务(填空、多项选择、判断对错、思维链判断对错、问答和文本补全)定制独特提示词,我们使数据适用于多种任务。若模型具备稳健的ToM,它应能在不同测试的不同提示词下均取得良好表现。我们使用本数据集评估了text-davinci-003和gpt-3.5-turbo-0301两个GPT-3.5模型。结果表明,在ToM任务上实现一致表现仍具挑战性。