Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.
翻译:关于命名实体的基于知识的视觉问答是一项具有挑战性的任务,需要从多模态知识库中检索信息。命名实体具有多样化的视觉表征,因此难以识别。我们认为,跨模态检索有助于弥合实体与其视觉表现之间的语义鸿沟,并且首先与单模态检索相辅相成。我们通过使用多模态双编码器(即CLIP)在最新的ViQuAE、InfoSeek和Encyclopedic-VQA数据集上进行实验,提供了经验证据。此外,我们研究了三种不同的微调策略:单模态、跨模态或联合训练。我们的方法结合了单模态和跨模态检索,在三个数据集上与拥有数十亿参数的模型具有竞争力,同时在概念上更简单且计算成本更低。