In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs "see" the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.
翻译:本文探讨了增强当代视觉语言模型(VLM)在零样本与少样本视觉问答(VQA)任务中表现的有效提示技术。研究核心聚焦于问题模板如何引导VLM生成准确答案,发现特定模板对VQA结果具有显著影响,凸显了策略性模板选择的必要性。另一关键方面是为VLM补充图像描述,使其在VQA任务中除了直接图像特征外还能获得额外视觉线索。令人惊讶的是,即使VLM已直接"观看"图像,这种增强方法仍在多数情况下显著提升了模型性能。我们探索了思维链(CoT)推理,发现标准CoT会导致性能下降,而自洽性等高级方法可帮助恢复性能。此外,纯文本少样本示例能增强VLM对任务格式的适应度,对易产生冗长零样本答案的模型尤为有益。最后,为缓解基于字符串匹配的VQA指标评估自由形式开放式回答的挑战,我们提出了一种简单的LLM引导预处理技术,用于调整模型输出使其符合预期标准答案分布。综上,本研究揭示了VLM在VQA任务中提示策略的复杂性,强调了协同使用图像描述、模板与预处理以提升模型效能的重要性。