Humans talk in free-form while negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to negotiate common ground. We design an explicit mind module that can track three-level beliefs -- the speaker's belief, the speaker's prediction of the listener's belief, and the common belief based on the gap between the first two. Then the speaking act classification head will decide to continue to talk, end this turn, or take task-related action. We augment a common ground alignment dataset MutualFriend with belief dynamics annotation, of which the goal is to find a single mutual friend based on the free chat between two agents. Experiments show that our model with mental state modeling can resemble human responses when aligning common ground meanwhile mimic the natural human conversation flow. The ablation study further validates the third-level common belief can aggregate information of the first and second-order beliefs and align common ground more efficiently.
翻译:人类在自由对话中协商表达含义或共同背景。尽管大型生成式语言模型展现出令人印象深刻的对话能力,但它们并未考虑共享情境环境中个体在语境理解上的差异。本研究提出MindDial——一种新型对话框架,能够生成情境化自由形态应答以协商共同背景。我们设计了一个显式心智模块,可追踪三重信念——说话者信念、说话者对听者信念的预测,以及基于前两者差距形成的共同信念。其后,言语行为分类头将决定继续对话、结束本轮发言或执行任务相关动作。我们通过添加信念动态标注,增强了共同背景对齐数据集MutualFriend,该数据集旨在基于两个智能体之间的自由对话寻找唯一的共同朋友。实验表明,我们的心理状态建模模型在协调共同背景时能够模拟人类响应,同时模仿自然对话流。消融研究进一步验证了第三级共同信念能够聚合第一、二级信念信息,并更高效地协调共同背景。