This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.
翻译:本文研究了一类视觉问答任务,该类任务需要访问外部知识才能回答问题,被称为外部知识视觉问答(OK-VQA)。开发OK-VQA系统的一个关键步骤是为给定的多模态查询检索相关文档。目前,该任务的最先进非对称密集检索模型采用多模态查询编码器与单模态文档编码器的架构。此类架构需要大量训练数据才能实现有效性能。我们提出了一种自动数据生成流水线,用于预训练OK-VQA任务的段落检索模型。与当前最先进的非对称架构相比,所提方法在Precision@5指标上实现了26.9%的提升。此外,所提出的预训练方法在零样本检索场景中展现出良好性能。