Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
翻译:心理理论(Theory of Mind, ToM)——即推理自我和他人心理状态的能力——是人类社会智能的基石。随着大型语言模型(LLMs)在现实世界应用中变得无处不在,验证其进行此类社会推理的能力对于实现有效且自然的交互至关重要。然而,现有评估LLMs心理理论能力的基准存在局限;大多数仅依赖文本输入,且狭隘地关注与信念相关的任务。在本文中,我们提出了一种新的多模态基准数据集CoMMET(Comprehensive Mental states and Moral Evaluation Task),这是一个受心理理论手册任务启发的综合性心理状态与道德评估任务。CoMMET通过覆盖更广泛的心理状态并引入多轮测试,扩展了评估范围。据我们所知,这是首个在多轮对话设置中评估心理理论的多模态数据集。通过对不同系列和规模的LLMs进行全面评估,我们分析了当前模型的优势与局限,并指出了未来改进的方向。我们的工作为深入理解现代LLMs的社会认知能力提供了新的见解。