Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.
翻译:建立共同基础——即一套共享的信念与相互认可的事实——是协作的根本,但对当前人工智能系统而言仍是一项挑战,尤其是在多模态、多方参与的协作场景中,参与者各自掌握不同的信息。我们提出了分布式部分信息谜题(DPIP),这是一种在认知不对称条件下引发丰富多模态交流的协作构建任务。我们构建了一个记录此类交互的多模态数据集,该数据集经过标注,并在语音、手势与动作模态间进行了时间对齐,以支持对命题内容与信念动态的推理。随后,我们评估了两种建模共同基础(CG)的范式:(1)采用最先进的大语言模型(LLMs),通过提示使其从多模态更新中推断共享信念;(2)基于动态认知逻辑(DEL)构建的公理化流程,以增量方式执行相同任务。在已标注的DPIP数据上的实验结果表明,该任务对现代LLMs同时追踪任务进展与信念状态的能力构成了挑战。