PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common. It presents machines with a great challenge to learn how people build common ground around multimodal context to communicate effectively. Methods developed in the literature, however, cannot be deployed to real gameplay since they only tackle some subtasks of the game, and they require additional reference chains inputs, whose extraction process is imperfect. Therefore, we propose a reference chain-free listener model that directly addresses the game's predictive task, i.e., deciding whether an image is shared with partner. Our DeBERTa-based listener model reads the full dialogue, and utilizes CLIPScore features to assess utterance-image relevance. We achieve >77% accuracy on unseen sets of images/game themes, outperforming baseline by >17 points.
翻译:PhotoBook是一种协作式对话游戏,两名玩家各自获赠部分重叠的私有图像集,并需确定双方共有的图像。该游戏为机器理解人类如何在多模态语境中建立共同基础以有效沟通带来了巨大挑战。然而,现有文献中的方法仅处理游戏的某些子任务,且需依赖提取过程不完善的额外指称链输入,因此无法应用于实际游戏场景。为此,我们提出一种无需指称链的听者模型,直接解决游戏的核心预测任务——即判断某张图像是否为对方玩家共享。我们的DeBERTa基听者模型读取完整对话,并利用CLIPScore特征评估话语与图像的相关性。在未见过的图像集/游戏主题上,该模型准确率超过77%,较基准模型提升逾17个百分点。