We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.
翻译:我们提出MToMnet——一种用于从多模态输入预测人类社交互动中信念及其动态变化的心理理论(ToM)神经网络。心理理论对于有效的非语言人际交流与协作至关重要,然而现有的信念建模方法要么未包含显式的心理理论建模,要么通常局限于单一或两种模态。MToMnet对情境线索(场景视频与物体位置)进行编码,并通过为每个个体单独构建的MindNet将其与个体特异性线索(人类注视与肢体语言)相整合。受社会认知与计算心理理论领域先前研究的启发,我们提出了三种不同的MToMnet变体:两种涉及潜在表征融合,一种涉及分类分数重排序。我们在两个具有挑战性的真实世界数据集上评估了所提方法,一个专注于信念预测,另一个则考察信念动态预测。实验结果表明,MToMnet以显著优势超越现有方法,同时所需参数量大幅减少。综上所述,我们的方法为未来人工智能系统研究开辟了极具前景的方向,这类系统能够通过非语言行为稳健预测人类信念,从而更有效地与人类协作。