We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.
翻译:我们提出了EgoToM,这是一个新的视频问答基准,将心理理论评估扩展至第一人称领域。通过采用因果心理理论模型,我们为Ego4D数据集生成了多项选择视频问答实例,用于评估预测相机佩戴者目标、信念及后续行动的能力。我们研究了人类与最先进的多模态大语言模型在这三个相互关联的推理问题上的表现。评估结果表明,MLLMs在从第一人称视频推断目标方面达到了接近人类水平的准确率。然而,在推断相机佩戴者的即时信念状态以及与未观测视频未来最一致的未来行动时,MLLMs(包括我们测试的参数规模超过1000亿的最大模型)的表现仍落后于人类。我们相信,本研究结果将为未来一类重要第一人称数字助手的设计提供指引,这类助手将配备能够合理建模用户内部心理状态的认知框架。