Pre-trained visual language models (VLM) have shown excellent performance in image caption tasks. However, it sometimes shows insufficient reasoning ability. In contrast, large language models (LLMs) emerge with powerful reasoning capabilities. Therefore, we propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios. TReE contains three stages: observation, thinking, and re-thinking. Observation stage indicates that VLM obtains the overall information of the relative image. Thinking stage combines the image information and task description as the prompt of the LLM, inference with the rationals. Re-Thinking stage learns from rationale and then inference the final result through VLM.
翻译:预训练的视觉语言模型(VLM)在图像描述任务中表现出色,但有时其推理能力不足。相比之下,大语言模型(LLM)展现出强大的推理能力。为此,我们提出一种名为TReE的方法,可在零样本场景下将大语言模型的推理能力迁移至视觉语言模型。TReE包含三个阶段:观察、思考与再思考。观察阶段中,VLM获取相关图像的全局信息;思考阶段将图像信息与任务描述相结合作为LLM的提示,通过推理过程进行推导;再思考阶段则从推理依据中学习,并最终通过VLM推导出结果。