Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
翻译:开源多模态大语言模型(MLLMs)在广泛的多模态任务中展现出巨大潜力。然而,其推理能力仍受限于现有的指令微调数据集,这些数据集主要从学术数据集(如VQA、AI2D和ChartQA)转化而来。这些数据集面向的任务较为简单,且仅提供短语级别的答案,缺乏任何中间推理过程。为应对这些挑战,我们提出一种可扩展且经济高效的方法,用于构建一个大规模、富含中间推理过程的多模态指令微调数据集,旨在激发思维链(CoT)推理。仅使用开源模型,我们创建了一个包含1200万指令-响应对的数据集,覆盖多样化、需要密集推理的任务,并提供详细且可靠的推理过程。实验表明,在此数据集上训练MLLM能显著提升推理能力,在MathVerse(+8.1%)、MMMU-Pro(+7%)和MuirBench(+13.3%)等基准测试中达到最先进的性能。此外,该模型在非推理型基准测试上也表现出高达4%的显著提升。消融研究进一步凸显了数据集构建过程中关键组件(如重写和自过滤)的重要性。