The zero-shot performance of visual question answering (VQA) models relies heavily on prompts. For example, a zero-shot VQA for disaster scenarios could leverage well-designed Chain of Thought (CoT) prompts to stimulate the model's potential. However, using CoT prompts has some problems, such as causing an incorrect answer in the end due to the hallucination in the thought process. In this paper, we propose a zero-shot VQA named Flood Disaster VQA with Two-Stage Prompt (VQA-TSP). The model generates the thought process in the first stage and then uses the thought process to generate the final answer in the second stage. In particular, visual context is added in the second stage to relieve the hallucination problem that exists in the thought process. Experimental results show that our method exceeds the performance of state-of-the-art zero-shot VQA models for flood disaster scenarios in total. Our study provides a research basis for improving the performance of CoT-based zero-shot VQA.
翻译:视觉问答(VQA)模型的零样本性能高度依赖于提示。例如,针对灾害场景的零样本VQA可借助精心设计的思维链(CoT)提示来激发模型潜力。然而,使用CoT提示存在一些问题,例如由于思维过程中的幻觉而导致最终答案错误。本文提出了一种名为“两阶段提示的洪水灾害VQA”(VQA-TSP)的零样本VQA模型。该模型在第一阶段生成思维过程,然后在第二阶段利用该思维过程生成最终答案。特别地,我们在第二阶段引入视觉上下文,以缓解思维过程中存在的幻觉问题。实验结果表明,在洪水灾害场景中,我们的方法整体上超越了现有最先进的零样本VQA模型。本研究为提升基于CoT的零样本VQA性能提供了研究基础。