Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara,Nabarun Goswami,Hanqin Wang,Toshiaki Baba,Kohtaro Tanaka,Tomohiro Hashimoto,Kai Wang,Rei Ito,Takagi Naoya,Ryo Umagami,Yingyi Wen,Tanachai Anakewat,Tatsuya Harada

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

翻译：大规模多模态模型（LMMs）在视觉内容理解与推理中的智能系统需求日益增长，要求模型不仅具备高精度，还需具备显式推理能力。本文提出一种创新方法，赋予LMM基于视觉内容与文本指令进行显式推理的能力。我们设计了一个可通过提问获取必要知识以增强推理过程鲁棒性和可解释性的系统。该方法包含由大语言模型（LLM）生成的新型数据集，旨在融合思维链推理与提问机制。我们构建的LMM在区域感知方面具有卓越能力，可应对图像-文本对齐中的复杂需求。该模型经历三阶段训练：首先利用大规模数据集进行图像-文本对齐，其次进行指令微调，最后聚焦思维链推理的精细化训练。实验结果表明，该模型在鲁棒性、准确性和可解释性方面取得突破，能够在面对模糊视觉输入时进行显式推理并主动寻求信息。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/