Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.
翻译:视觉-语言模型(VLM)已被证明能基于简单文本查询有效完成图像检索,但基于对话输入的文本-图像检索仍然是一个挑战。因此,若要将VLM用于视觉对话中的指代消解任务,这些模型的语篇处理能力需要得到增强。针对该问题,我们提出微调一个因果大语言模型(LLM),使其生成能够总结引用在语言语境中指代信息的确定性描述。随后,我们利用预训练的VLM零样本地基于生成的描述识别指代对象。我们在一个手动标注的视觉对话数据集上评估了该方法,结果平均优于对比基线模型的表现。此外,我们发现基于更大上下文窗口的指代描述具有更高性能提升的潜力。