In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit strong performance for problems involving understanding the literal visual content, which we noted as visual commonsense understanding (VCU). For problems where the goal is to infer conclusions beyond image content, which we noted as visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well. We empirically validate this by letting LLMs classify VCR problems into these two categories and show the significant difference between VLM and LLM with image caption decision pipelines on two subproblems. Moreover, we identify a challenge with VLMs' passive perception, which may miss crucial context information, leading to incorrect reasoning by LLMs. Based on these, we suggest a collaborative approach, named ViCor, where pre-trained LLMs serve as problem classifiers to analyze the problem category, then either use VLMs to answer the question directly or actively instruct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain fine-tuning.
翻译:在本文工作中,我们探索了预训练视觉-语言模型(VLM)与大语言模型(LLM)在视觉常识推理(VCR)问题上的协同能力。研究发现,基于VLM和LLM的决策流水线在不同类型的VCR问题上各具优势。预训练VLM在处理涉及理解画面文字内容的视觉常识理解类问题(记为VCU)时表现强劲;而对于需从图像内容外推断结论的视觉常识推理问题(记为VCI),VLM面临困难,但LLM在获得充分视觉证据后,能运用常识较好地推断答案。我们通过让LLM将VCR问题分为这两类进行实验验证,结果显示VLM与基于图像描述的LLM决策流水线在两个子问题上存在显著差异。此外,我们识别出VLM被动感知机制可能遗漏关键上下文信息,进而导致LLM推理出错的挑战。基于此,我们提出名为ViCor的协作方法:预训练LLM作为问题分类器分析问题类型,随后直接使用VLM回答问题,或主动引导VLM聚焦并收集相关视觉要素以支持潜在常识推理。我们在两个VCR基准数据集上评估该框架,在无需领域微调的情况下超越了所有其他方法。