Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded data Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
翻译:预训练的多模态大语言模型(MLLMs)在多种多模态任务上展现出强大的性能,但在那些难以收集标注的领域中,其推理能力仍然有限。在本工作中,我们关注图表、渲染文档和网页等人工图像领域,这些领域在实践中非常丰富,但缺乏大规模的人工标注推理数据集。我们提出了COGS(基于组合的数据合成框架),这是一个数据高效的框架,旨在通过一小部分种子问题,为MLLMs赋予高级推理能力。其核心思想是将每个种子问题分解为基本的感知和推理因子,然后可以系统地与新的图像重新组合,以生成大量合成的问答对。每个生成的问题都配有子问题和中间答案,从而能够通过因子级的过程奖励进行强化学习。在图表推理上的实验表明,COGS显著提高了对未见问题的性能,在推理密集型和组合型问题上收益最大。此外,使用不同种子数据的因子级混合进行训练,能在多个数据集上实现更好的迁移效果,这表明COGS诱导的是可泛化的能力,而非数据集特定的过拟合。我们进一步证明,该框架可以扩展到图表之外的其他领域,例如网页。