We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ``ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion without being dependent on intermediate modules such as object detectors or caption generators. We introduce a new pretraining task that is effective for learning knowledge retrieval with multimodal queries and also improves performance on downstream tasks. We demonstrate superior performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot settings as well as further improvements when finetuned on these datasets.
翻译:我们研究了多模态查询下的知识检索任务,即查询信息分散于图像和文本输入中的情况,这一挑战性任务不同于以往跨模态检索的工作。我们构建了名为ReMuQ的新数据集,用于衡量该任务的进展。ReMuQ要求系统通过整合文本与图像查询的内容,从大型语料库中检索知识。我们提出了检索模型"ReViz",它能够直接处理输入文本与图像,以端到端方式完成相关知识检索,无需依赖目标检测器或描述生成器等中间模块。我们设计了一种新型预训练任务,该任务可有效学习多模态查询下的知识检索,并提升下游任务性能。在零样本设置下,我们在ReMuQ和OK-VQA两个数据集上展现了检索性能的优越性,且在针对这些数据集进行微调后性能进一步提升。