Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models

from arxiv, Accepted to EMNLP'24 Main (Upcoming). Data and code at www.github.com/JChiyah/blockworld-repairs - for Bibtex see https://raw.githubusercontent.com/JChiyah/blockworld-repairs/refs/heads/main/citation.bib

In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generalising better to new scenarios. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction. Our code and data are available at www.github.com/JChiyah/blockworld-repairs

翻译：在对话中，听者可能最初误解说话者并作出错误回应，这通常会促使说话者在下一轮对话中通过"第三位置修复"来纠正误解。因此，处理并恰当回应此类修复序列的能力对会话式人工智能系统至关重要。本文首先收集、分析并公开发布了BlockWorld-Repairs数据集：这是一个在指令跟随操作任务中构建的多模态TPR序列数据集，该任务在设计上本身就充满指代模糊性。我们利用该数据集评估了多种最先进的视觉语言模型在不同设置下的表现，重点关注它们处理和准确响应TPR以从沟通失误中恢复的能力。研究发现，与人类相比，所有模型在此任务中均表现显著不足。我们进一步证明，通过在微调过程中采用针对相关标记的专用损失函数，VLM能够获得性能提升，并展现出更好的泛化能力。结果表明，这些模型尚未达到可部署于修复频繁出现的多模态协作环境的成熟度，同时凸显了设计能够促进交互学习的训练机制与目标的重要性。我们的代码与数据已发布于www.github.com/JChiyah/blockworld-repairs。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日