Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the problem, a promising solution is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4V by about 25% in accuracy.
翻译:大型视觉语言模型(LVLM)无法获知最新知识,例如LLaVA系列模型,由于更新所需资源巨大而无法频繁更新,因此在许多场景下会失效。例如,若某LVLM发布于2024年1月,它将无法知晓2024年2月才上映的电影《沙丘2》的详细情节。为解决此问题,一种可行方案是在推理过程中通过互联网搜索为LVLM提供最新知识,即互联网增强生成(IAG),该技术已集成于GPT-4V等闭源商业LVLM中,但其具体实现机制尚未公开。本文提出一种即插即用框架UDKAG,用于增强现有LVLM处理涉及最新知识的视觉问答(VQA)任务的能力。我们训练了一个分层过滤模型,能高效地从搜索引擎返回的网页中筛选出最有价值的内容,从而为LVLM注入最新知识。为训练模型并评估框架性能,我们设计了一套自动生成新闻相关VQA样本的流程,构建了UDK-VQA数据集。通过引入多模型投票机制,我们对网页/内容在VQA样本中的有效性进行标注以构建训练集。实验结果表明,该框架性能显著,准确率较GPT-4V提升约25%。