A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.
翻译:人工智能系统的一个长期目标是像人类一样执行复杂的多模态推理。近期,大语言模型(LLMs)通过利用链式思维(CoT)模拟人类思维,在仅基于语言模态的多步推理方面取得了显著进展。然而,将这些进展迁移到多模态场景中引入了更严峻的挑战,包括但不限于劳动密集型标注的实际不可行性,以及灵活性、泛化能力和可解释性方面的局限性。为了在多模态中激发CoT推理,本文首先深入分析了多模态带来的上述挑战,并提出两个关键见解:多模态CoT推理中应“保持批判性思维”和“各司其职”。进一步地,本研究提出了一种新颖的DDCoT提示方法,它通过负空间提示保持批判态度,并首先将LLMs的推理责任划分为推理与识别,随后将视觉模型的视觉识别能力融入联合推理过程,从而将多模态纳入推理。DDCoT生成的推理链不仅提升了大小语言模型在零样本提示和微调学习中的推理能力,显著优于当前最先进方法,而且展现出卓越的泛化能力和可解释性。