Large language models have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training, do these models also understand how to answer visual information seeking questions? To answer this question, we present InfoSeek, a Visual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking question-answer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation. Based on InfoSeek, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated InfoSeek dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.
翻译:大型语言模型在回答知识密集型问题方面已展现出涌现能力。随着网络规模视觉与语言预训练的最新进展,这些模型是否也能理解如何回答视觉信息寻求型问题?为探究此问题,我们提出InfoSeek——一个专注于信息寻求型问题的视觉问答数据集,其中所涉信息无法通过常识知识回答。我们采用多阶段人工标注,收集了高质量视觉信息寻求型问答对的自然分布,并构建了一个大规模自动采集数据集(融合现有视觉实体识别数据集与维基百科),提供超百万样本用于模型微调与验证。基于InfoSeek,我们分析了多种预训练视觉问答系统的特性,揭示了不同预训练模型的差异。研究表明,当前最先进的多模态预训练模型在回答视觉信息寻求型问题时仍面临挑战,但通过基于自动生成的InfoSeek数据集进行微调,该能力得以提升。我们期望这一分析能为理解与开发下一代多模态预训练模型奠定基础。