We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.
翻译:本文提出了一种多智能体具身环境中的指称表达生成与理解任务及数据集。在该任务中,共享场景的两个智能体必须考虑彼此可能不同的视觉视角,以生成和理解对场景中物体及其空间关系的指称。我们收集了包含2,970条人工撰写的指称表达数据集,每条均配有人类理解判断标注,并评估了自动化模型作为说话者与人类听者配对时的表现。实验发现,模型在指称生成和理解两方面的表现均落后于人类智能体配对组。最后,我们通过训练一个开源权重的说话者模型,并采用与听者配对的交流成功率作为训练依据,使交流成功率从58.9%提升至69.3%,甚至超越了最强的闭源模型。