Visual question answering (VQA) is the task of answering questions about an image. The task assumes an understanding of both the image and the question to provide a natural language answer. VQA has gained popularity in recent years due to its potential applications in a wide range of fields, including robotics, education, and healthcare. In this paper, we focus on knowledge-augmented VQA, where answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a multimodal framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately. We benchmark our method on the multi-choice question-answering task of the A-OKVQA, Science-QA, VSR, and IconQA datasets using CLIP and BLIP models. We show that the use of language guidance is a simple but powerful and effective strategy for visual question answering. Our language guidance improves the performance of CLIP by 7.6% and BLIP-2 by 4.8% in the challenging A-OKVQA dataset. We also observe consistent improvement in performance on the Science-QA, VSR, and IconQA datasets when using the proposed language guidances. The implementation of LG-VQA is publicly available at https:// github.com/declare-lab/LG-VQA.
翻译:视觉问答(VQA)是一项根据图像回答问题的任务,它要求同时理解图像和问题以生成自然语言答案。近年来,VQA因其在机器人、教育和医疗等广泛领域的潜在应用而备受关注。本文聚焦于知识增强型VQA,即回答问题需要常识知识、世界知识以及对图像中未呈现的想法和概念进行推理。我们提出了一种多模态框架,利用理由、图像描述、场景图等形式的语言引导(LG)来更准确地回答问题。我们在A-OKVQA、Science-QA、VSR和IconQA数据集的多选题问答任务上,使用CLIP和BLIP模型对方法进行了基准测试。结果表明,语言引导是一种简单但强大且有效的视觉问答策略。在具有挑战性的A-OKVQA数据集上,我们的语言引导使CLIP的性能提升了7.6%,BLIP-2提升了4.8%。在使用所提出的语言引导时,我们在Science-QA、VSR和IconQA数据集上也观察到了一致的性能提升。LG-VQA的实现代码已在https://github.com/declare-lab/LG-VQA公开。