MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

翻译：视觉语言模型（VLMs）的最新进展显著推动了视觉推理领域的进步。然而，开源VLMs仍落后于专有系统，这主要归因于高质量推理数据的缺乏。现有数据集对STEM图表和视觉谜题等挑战性领域的覆盖有限，并且缺乏能够激发强大推理能力所必需的一致、长形式的思维链（CoT）标注。为弥合这一鸿沟，我们引入了MMFineReason，这是一个大规模多模态推理数据集，包含180万个样本和51亿个解答词元，其高质量推理标注源自Qwen3-VL-235B-A22B-Thinking模型。该数据集通过一个系统化的三阶段流程构建：(1) 大规模数据收集与标准化，(2) CoT原理生成，以及(3) 基于推理质量和难度感知的综合筛选。最终的数据集涵盖STEM问题、视觉谜题、游戏和复杂图表，每个样本均标注了基于视觉的推理轨迹。我们在MMFineReason上对Qwen3-VL-Instruct进行微调，开发了MMFineReason-2B/4B/8B版本。我们的模型在其规模类别中取得了新的最先进成果。值得注意的是，MMFineReason-4B成功超越了Qwen3-VL-8B-Thinking，而MMFineReason-8B甚至优于Qwen3-VL-30B-A3B-Thinking，并接近Qwen3-VL-32B-Thinking的性能，展现了卓越的参数效率。关键的是，通过我们的难度感知过滤策略，我们发现了一种"少即是多"的现象：仅包含7%（12.3万个样本）的数据子集即可达到与完整数据集相当的性能。尤为重要的是，我们揭示了推理导向的数据组合能同时提升模型通用能力的协同效应。