Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge not only in natural language tasks, but also in some vision-language tasks such as open-domain knowledge-based visual question answering (OK-VQA). As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. This leads to discrepancies between images and their textual representations presented to LLMs, which consequently impedes final reasoning performance. To fill the information gap and better leverage the reasoning capability, we design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image, along with filters for refining the generated information. We validate our idea on OK-VQA and A-OKVQA. Our method continuously boosts the performance of baselines methods by an average gain of 2.15% on OK-VQA, and achieves consistent improvements across different LLMs.
翻译:大型语言模型(LLMs)不仅在自然语言任务中展现出卓越的推理能力和世界知识储备,在开放域知识型视觉问答(OK-VQA)等部分视觉-语言任务中也表现突出。由于图像对LLMs不可见,研究者将图像转换为文本以促使LLMs参与视觉问题推理过程。这导致图像与其呈现给LLMs的文本表征之间存在差异,进而阻碍最终推理性能。为填补信息鸿沟并更好地利用推理能力,我们设计了一个框架,使LLMs能够主动提出相关问题以揭示图像中的更多细节,并配备过滤器以优化生成的信息。我们在OK-VQA和A-OKVQA上验证了该想法。我们的方法持续提升基线方法的性能,在OK-VQA上平均提升2.15%,并在不同LLMs上取得一致改进。