Visual question answering (VQA) is a challenging task that requires the ability to comprehend and reason with visual information. While recent vision-language models have made strides, they continue to struggle with zero-shot VQA, particularly in handling complex compositional questions and adapting to new domains i.e. knowledge-based reasoning. This paper explores the use of various prompting strategies, focusing on the BLIP2 model, to enhance zero-shot VQA performance. We conduct a comprehensive investigation across several VQA datasets, examining the effectiveness of different question templates, the role of few-shot exemplars, the impact of chain-of-thought (CoT) reasoning, and the benefits of incorporating image captions as additional visual cues. Despite the varied outcomes, our findings demonstrate that carefully designed question templates and the integration of additional visual cues, like image captions, can contribute to improved VQA performance, especially when used in conjunction with few-shot examples. However, we also identify a limitation in the use of chain-of-thought rationalization, which negatively affects VQA accuracy. Our study thus provides critical insights into the potential of prompting for improving zero-shot VQA performance.
翻译:视觉问答(VQA)是一项要求具备理解与推理视觉信息能力的挑战性任务。尽管近期视觉语言模型取得了进展,但在零样本VQA中仍面临困难,特别是在处理复杂组合性问题以及适应新领域(如基于知识的推理)方面。本文以BLIP2模型为核心,探索多种提示策略以提升零样本VQA性能。我们系统地在多个VQA数据集上展开研究,考察不同问题模板的效果、少样本示例的作用、思维链(CoT)推理的影响,以及将图像描述作为额外视觉线索的益处。尽管结果呈现多样性,我们的发现表明,精心设计的问询模板和结合图像描述等额外视觉线索有助于提升VQA性能,尤其是在搭配少样本示例使用时。然而,我们也识别出思维链推理的一个局限性——其会负面影响VQA准确性。因此,本研究为利用提示技术提升零样本VQA性能提供了关键见解。