Large vision-language models (LVLMs) excel across diverse tasks involving concrete images from natural scenes. However, their ability to interpret abstract figures, such as geometry shapes and scientific plots, remains limited due to a scarcity of training datasets in scientific domains. To fill this gap, we introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap, we introduce ArXivQA, a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical reasoning capabilities, achieving a 10.4\% absolute accuracy gain on a multimodal mathematical reasoning benchmark. Furthermore, employing ArXivCap, we devise four vision-to-text tasks for benchmarking LVLMs. Evaluation results with state-of-the-art LVLMs underscore their struggle with the nuanced semantics of academic figures, while domain-specific training yields substantial performance gains. Our error analysis uncovers misinterpretations of visual context, recognition errors, and the production of overly simplified captions by current LVLMs, shedding light on future improvements.
翻译:大型视觉语言模型(LVLMs)在处理自然场景中具体图像的多样化任务上表现出色。然而,由于科学领域训练数据集的稀缺,它们在解析抽象图形(如几何形状和科学图表)方面的能力仍然有限。为填补这一空白,我们提出了多模态ArXiv数据集,包含ArXivCap和ArXivQA两部分,旨在增强LVLMs的科学理解能力。ArXivCap是一个图文配对数据集,包含640万张图像和390万条标题,源自涵盖多个科学领域的57.2万篇ArXiv论文。基于ArXivCap,我们构建了ArXivQA问答数据集,该数据集通过提示GPT-4V基于科学图形生成问题而创建。ArXivQA显著提升了开源LVLMs的数学推理能力,在多模态数学推理基准测试中实现了10.4%的绝对准确率提升。此外,利用ArXivCap,我们设计了四项视觉到文本任务用于评估LVLMs。对前沿LVLMs的评估结果表明,它们在处理学术图形的细微语义时面临困难,而领域特定训练能带来显著的性能提升。我们的错误分析揭示了当前LVLMs存在视觉上下文误解、识别错误以及生成过度简化标题等问题,这为未来改进指明了方向。