Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.

翻译：大型语言模型（LLM）通过增强推理能力展现出更高的性能与可靠性，其发展轨迹从思维链提示逐步演进至OpenAI o1等产品级解决方案。尽管已有诸多研究致力于提升LLM的推理能力，但在视觉语言任务中，高质量的长链推理数据与优化的训练流程仍未得到充分探索。本文提出Insight-V，作为该领域的早期尝试，旨在：1）规模化生成适用于复杂多模态任务的长链鲁棒性推理数据；2）构建有效的训练流程以增强多模态大语言模型（MLLM）的推理能力。具体而言，为无需人工干预生成长链结构化推理数据，我们设计了一个包含渐进策略的两阶段流程：首先生成足够长且多样化的推理路径，继而通过多粒度评估方法确保数据质量。我们发现直接使用此类长而复杂的推理数据监督MLLM无法获得理想的推理能力。为此，我们构建了一个多智能体系统，包含专门执行长链推理的推理智能体，以及经过训练用于评判与总结推理结果的摘要智能体。我们进一步引入迭代式DPO算法以提升推理智能体生成结果的稳定性与质量。基于流行的LLaVA-NeXT模型及我们自研的更强基础MLLM，实验表明该方法在需要视觉推理的挑战性多模态基准测试中取得了显著性能提升。得益于多智能体系统设计，Insight-V还能在保持甚至提升以感知为核心的多模态任务性能的同时，实现推理能力的增强。