The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.
翻译:大规模多模态模型(LMMs)在视觉内容理解与推理中的智能系统需求日益增长,要求模型不仅具备高精度,还需具备显式推理能力。本文提出一种创新方法,赋予LMM基于视觉内容与文本指令进行显式推理的能力。我们设计了一个可通过提问获取必要知识以增强推理过程鲁棒性和可解释性的系统。该方法包含由大语言模型(LLM)生成的新型数据集,旨在融合思维链推理与提问机制。我们构建的LMM在区域感知方面具有卓越能力,可应对图像-文本对齐中的复杂需求。该模型经历三阶段训练:首先利用大规模数据集进行图像-文本对齐,其次进行指令微调,最后聚焦思维链推理的精细化训练。实验结果表明,该模型在鲁棒性、准确性和可解释性方面取得突破,能够在面对模糊视觉输入时进行显式推理并主动寻求信息。