Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.
翻译:对话中,当指代表达未能唯一确定接收者预期的指代对象时,会产生指代歧义。接收者通常会立即识别此类歧义,并通过元交际性的“澄清性交流”(CE:包含澄清请求及回应)与说话者协作进行修复。本文认为,生成与回应澄清性请求的能力,对多模态、基于视觉的对话模型的架构与目标函数施加了特定约束。我们利用SIMMC 2.0数据集,通过一项探测模型因澄清性交流而产生的上下文更新的指标,评估了不同先进模型架构处理澄清性交流的能力。研究发现:基于语言的模型能够编码简单的多模态语义信息并处理部分澄清性交流,尤其擅长处理与对话历史相关的部分;而多模态模型则可利用额外学习目标获得解耦的对象表示,这些表示对于跨模态处理复杂指代歧义至关重要。