Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.
翻译:理解涉及语言与非语言线索的社交互动,对于有效解读社交情境至关重要。然而,现有关于多模态社交线索的研究大多聚焦于单人行为,或依赖未能与多人环境中话语对齐的整体视觉表征,因此在建模多人互动的复杂动态时存在局限性。本文提出了三个新挑战性任务以建模多人间的细粒度动态:说话目标识别、代词共指消解与被提及玩家预测。我们通过社交推理游戏场景中的大量数据标注,构建了这些新挑战任务。此外,我们提出了一种新型多模态基线方法,通过将视觉特征与对应话语同步,实现密集对齐的语言-视觉表征。这有助于同时捕捉与社交推理相关的语言与非语言线索。实验证明,采用密集对齐多模态表征的方法能有效建模细粒度社交互动。项目网站:https://sangmin-git.github.io/projects/MMSI。