OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.

翻译：多模态大语言模型在自然图像理解方面表现出强大性能，但在解释科学图像（包括但不限于示意图、实验表征图和分析图表）方面能力有限。这一局限在开源多模态大语言模型中尤为明显。该差距主要源于现有数据集存在领域覆盖有限、结构标注粗糙和语义基础薄弱等问题。我们提出了OmniScience，一个大规模、高保真的多模态数据集，包含150万个图-标题-上下文三元组，涵盖超过10个主要科学学科。为获得信息密度和准确性更高的图像描述数据以用于多模态大模型训练，我们开发了一种动态模型路由重描述流水线，该流水线利用最先进的多模态大语言模型，通过综合视觉特征、原始图标题以及人类科学家撰写的相应文本参考，生成稠密且自包含的描述。该流水线进一步通过严格的质量过滤和与人类专家判断的对齐进行强化，确保了事实准确性和语义完整性，并将图像-文本多模态相似度得分从0.769提升至0.956。我们进一步提出了一种描述问答协议，作为评估视觉理解的代理任务。在此设定下，基于OmniScience微调的Qwen2.5-VL-3B模型相较于基线模型显示出显著提升，在MM-MT-Bench上获得0.378的增益，在MMMU上获得0.140的增益。