Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar Bench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviors on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL's performance while disaggregating compute demands, and, (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.

翻译：尽管进展迅速，多模态推理领域仍缺乏超越视觉数学问题、系统化合成大规模视觉中心数据集的方法。我们提出了一个能够合成涵盖不同复杂度层次的视觉中心问题的框架，并由此构建了包含超过100万个高质量问题的数据集，其中包含：推理轨迹、偏好数据以及支持监督微调、离线与在线强化学习的指令提示。我们的视觉中心合成框架采用两阶段流程，重点在于：(1) 从现有图像中大规模生成多样化的可验证问题；(2) 通过合并更简单的问题来创建复杂的组合式视觉问题。值得注意的是，使用我们的数据对Qwen2.5-VL-7B进行微调，在所评估的所有视觉中心基准测试中均超越了现有的开放数据基线，并且我们的最佳配置在Vstar Bench、CV-Bench和MMStar-V上达到或超越了如MiMo-VL-7B-RL等强大的闭源数据模型。尤为突出的是，尽管我们的数据完全是视觉中心的，但它对纯文本推理（MMLU-Pro，+3.7%）和音频推理（MMAU，+1.32%）产生了积极的迁移效果，证明了其有效性。类似地，尽管不包含任何具身视觉数据，我们在评估开放式具身问答时也观察到了显著的性能提升（NiEH，+8.8%）。最后，我们利用我们的数据对完整的视觉语言模型后训练流程进行了大规模（超过100万样本）综合分析，结果表明：(i) 在具有推理轨迹认知行为的高质量数据上进行监督微调对于扩展在线强化学习至关重要；(ii) 离线强化学习可以在分解计算需求的同时匹配在线强化学习的性能；(iii) 在高质量数据上进行监督微调也能改善领域外、跨模态的迁移能力。