Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
翻译:为基座模型训练构建科学多模态文档推理数据集,始终需要在规模、忠实度与真实性之间进行权衡。为应对这一挑战,我们提出了合成与重锚定框架,该两阶段流程包括:(1)以主张为中心的问答合成,用于在聚焦的文档片段上生成忠实、独立的问答对及推理链;(2)文档级重锚定,通过编程方式将这些问答对重新嵌入至完整的文档任务中,以确保其具备真实的复杂性。基于此框架,我们构建了SciMDR——一个用于跨模态理解的大规模训练数据集,包含来自2万篇科学论文的30万个带有显式推理链的问答对。我们进一步构建了SciMDR-Eval,这是一个由专家标注的评测基准,用于评估在完整科学工作流中的多模态理解能力。实验表明,基于SciMDR微调的模型在多个科学问答基准测试中均取得显著提升,尤其是在需要复杂文档级推理的任务上。