Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. With the evolution of Multimodal Large Language Models (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT has yet to be thoroughly investigated. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. Specifically, IoT prompting can automatically design critical visual information extraction operations based on the input images and questions. Each step of visual information refinement identifies specific visual rationales that support answers to complex visual reasoning questions. Beyond the textual CoT, IoT simultaneously utilizes visual and textual rationales to help MLLMs understand complex multimodal information. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs. Moreover, the step-by-step visual feature explanations generated by IoT prompting elucidate the visual reasoning process, aiding in analyzing the cognitive processes of large multimodal models
翻译:近年来,思维链及其相关基于推理的方法显著提升了大型语言模型在复杂推理任务中的性能。随着多模态大语言模型的发展,增强其处理复杂多模态推理问题的能力成为一个关键前沿。然而,在思维链中融入多模态推理依据尚未得到深入研究。我们提出了图像思维提示方法,该方法能帮助多模态大语言模型逐步提取视觉推理依据。具体而言,图像思维提示能基于输入图像和问题自动设计关键视觉信息提取操作。每个视觉信息精炼步骤都会识别支持复杂视觉推理问题答案的特定视觉依据。超越文本思维链,图像思维提示同时利用视觉和文本推理依据,帮助多模态大语言模型理解复杂的多模态信息。图像思维提示在不同多模态大语言模型的各种视觉理解任务中,提升了零样本视觉推理性能。此外,通过图像思维提示生成的逐步视觉特征解释阐明了视觉推理过程,有助于分析大型多模态模型的认知机制。