Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.

翻译：随着多模态大语言模型的突破，回答需要高级推理能力和世界知识的复杂视觉问题已成为比以往更重要的AI模型开发测试平台。然而，赋予AI模型稳健的跨模态推理能力仍具挑战性，因为人类的认知机制尚未被系统理解。本文认为，若能尽可能多地收集给定图像中的视觉线索，就能更准确地识别图像、更深入地理解问题、更轻松地调用相关知识，并最终推理出答案。我们通过挖掘图像中的问答对并将其作为提示输入多模态大语言模型来发现这些丰富的视觉线索，将所提方法命名为Q&A Prompts。具体而言，首先利用训练集中的图像-答案对及对应问题作为输入和输出训练视觉问题生成模型；然后使用图像标注模型识别各种实例，将打包的图像-标签对输入视觉问题生成模型，以提取的图像标签为答案生成相关问题；最后通过视觉感知提示模块将这些生成的问答对编码为提示，输入预训练的多模态大语言模型推理出最终答案。实验结果表明，与最先进方法相比，我们的Q&A Prompts在OK-VQA和A-OKVQA等需要多样化世界知识的具有挑战性的视觉问答数据集上取得了显著改进。