Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training. Furthermore, we show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents, showing a significant space for improvement.
翻译:预训练的视觉与语言模型在涉及图像和文本的现有任务(包括视觉问答)中已展现出最先进的性能。然而,这些模型是否具备回答不仅查询视觉内容、还涉及知识密集型和信息寻求型问题的能力,目前仍不明确。在本研究中,我们引入了InfoSeek,这是一个专门针对无法仅凭常识回答的信息寻求型问题而设计的视觉问答数据集。利用InfoSeek,我们分析了多种预训练的视觉问答模型,并深入探究了它们的特性。研究结果表明,最先进的预训练多模态模型(例如PaLI-X、BLIP2等)在回答视觉信息寻求型问题时会遇到挑战,但在InfoSeek数据集上进行微调能够促使模型利用在预训练期间习得的细粒度知识。此外,我们展示了准确的视觉实体识别可通过检索相关文档来提升在InfoSeek上的性能,这表明了存在巨大的改进空间。