As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero shot, few shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.
翻译:作为认知智能的基础组成部分,心理理论(ToM)可使人工智能更接近人类思维过程,从而增强其与人类的交互与协作。特别地,它能显著提升模型在复杂场景下对视频的理解能力。然而,当前视频问答(VideoQA)数据集主要聚焦于事件内的因果推理研究,鲜有真正融入人类心理理论的。因此,VideoQA领域中ToM推理任务的发展尚显不足。本文提出BDIQA,这是首个在ToM背景下探索VideoQA模型认知推理能力的基准数据集。BDIQA受儿童ToM认知发展启发,旨在解决当前数据与任务中机器ToM的缺陷。具体而言,该数据集提供两个难度级别的任务,在简单与复杂场景下评估信念、欲望与意图(BDI)推理能力。我们针对多种主流VideoQA方法进行了评估,并通过零样本、少样本及监督学习诊断其能力。研究发现,预训练模型在认知推理任务上的表现仍不尽如人意。为应对这一挑战,我们开展了深入分析与实验,最终基于消融分析提出两项提升认知推理的指导原则。