Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
翻译:心智理论(ToM),即理解他人心智的能力,是开发具有人类水平社会智能机器的核心要素。近期机器学习模型,尤其是大型语言模型,似乎展现出部分ToM理解能力。然而,现有ToM基准测试仅使用单模态数据集(视频或文本)。相比之下,人类ToM远不止视频或文本理解——人们能基于从任何可用数据(包括视觉线索、语言叙述或两者结合)中提取的概念表征(如目标、信念、计划)灵活推理他人心智状态。为解决该问题,我们提出了多模态心智理论问答基准(MMToM-QA)。MMToM-QA通过多模态数据及关于个体在家庭环境中活动的不同单模态数据,对机器ToM能力进行全面评估。为构建多模态ToM能力,我们提出新方法BIP-ALM(语言模型加速的贝叶斯逆向规划)。BIP-ALM从多模态数据中提取统一表征,并利用语言模型实现可扩展的贝叶斯逆向规划。我们系统比较了人类表现、BIP-ALM及包括GPT-4在内的最先进模型。实验表明,大型语言模型与大型多模态模型仍缺乏鲁棒的ToM能力。而BIP-ALM通过结合基于模型的心智推理与语言模型两者的优势,展现出具有前景的结果。