For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.
翻译:对于生成式AI代理而言,准确预测人类意图是与人类用户有效协作的关键。然而,这种协作能力仍受制于一个关键缺陷:无法建模共同基础。我们开展了一项指代交流实验,采用析因设计,包含指导员-匹配者对(人类-人类、人类-AI、AI-人类、AI-AI),这些配对通过多轮次交互在多轮重复回合中匹配与任何明显词汇化标签无关的物体图片。研究表明,LVLMs无法以支持顺畅交流的方式交互式生成并解析指代表达——而这一能力正是人类语言使用的基石。我们发布了包含356段对话(89个配对×4轮次)的语料库,以及用于数据采集的在线流程和分析准确率、效率及词汇重叠度的工具。