In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generalising better to new scenarios. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction. Our code and data are available at www.github.com/JChiyah/blockworld-repairs
翻译:在对话中,听者可能最初误解说话者并作出错误回应,这通常会促使说话者在下一轮对话中通过"第三位置修复"来纠正误解。因此,处理并恰当回应此类修复序列的能力对会话式人工智能系统至关重要。本文首先收集、分析并公开发布了BlockWorld-Repairs数据集:这是一个在指令跟随操作任务中构建的多模态TPR序列数据集,该任务在设计上本身就充满指代模糊性。我们利用该数据集评估了多种最先进的视觉语言模型在不同设置下的表现,重点关注它们处理和准确响应TPR以从沟通失误中恢复的能力。研究发现,与人类相比,所有模型在此任务中均表现显著不足。我们进一步证明,通过在微调过程中采用针对相关标记的专用损失函数,VLM能够获得性能提升,并展现出更好的泛化能力。结果表明,这些模型尚未达到可部署于修复频繁出现的多模态协作环境的成熟度,同时凸显了设计能够促进交互学习的训练机制与目标的重要性。我们的代码与数据已发布于www.github.com/JChiyah/blockworld-repairs。