A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.
翻译:人工智能系统的长期目标是实现像人类一样复杂的多模态推理。近年来,大型语言模型通过利用思维链模拟人类思维,仅在语言模态上就在此类多步推理中取得了显著进展。然而,将这些进展迁移到多模态情境会带来更高挑战,包括但不限于需要劳动密集型标注这一不切实际的需求,以及在灵活性、泛化性和可解释性方面的局限性。为了在多模态中激发思维链推理,本文首先对这些由多模态引发的挑战进行了深入分析,并提出了两个关键洞见:在多模态思维链推理中"保持批判性思维"和"各司其职"。此外,本研究提出了一种新颖的DDCoT提示方法,通过负空间提示维持批判性态度,并首先将大型语言模型的推理职责划分为推理与识别,然后将视觉模型的视觉识别能力整合到联合推理过程中,从而将多模态纳入推理。DDCoT生成的推理依据,不仅在零样本提示和微调学习中提升了大型及小型语言模型的推理能力,显著优于现有最优方法,还展现出令人印象深刻的泛化性和可解释性。