For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
翻译:为使生成式人工智能代理能够与人类用户有效协作,准确预测人类意图的能力至关重要。然而,这种协作能力仍受到一个关键缺陷的限制:无法建模共同基础。本文提出了一项采用析因设计的指称沟通实验,涉及指挥者-匹配者配对(人-人、人-人工智能、人工智能-人、人工智能-人工智能),这些配对通过多轮重复互动来匹配与任何明显词汇化标签无关的物体图片。我们发布了数据收集的在线流程、用于准确性、效率和词汇重叠分析的工具,以及包含356段对话(89对参与者,每对完成4轮)的语料库。该研究揭示了大型视觉语言模型在交互式解析指称表达式方面的局限性——这是支撑人类语言使用的关键能力。