Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.
翻译:近期,大型语言模型(LLMs)强大的文本生成能力催生了许多辅助论文阅读甚至写作的工具。然而,LLMs或多模态LLMs在图表分析方面的薄弱能力极大限制了其应用场景,尤其在科学学术论文写作中。本文为构建更通用的学术论文写作伴侣,主要致力于增强多模态LLMs的图表分析能力。通过解析高质量论文的Latex源文件,我们精心构建了一个多模态图表理解数据集M-Paper。通过将论文中的图表与相关段落对齐,我们构建了用于训练和评估的专业图表分析样本。M-Paper是首个支持联合理解多种科学图表的数据集,包括以图像或Latex代码形式呈现的图形和表格。此外,为使写作伴侣更契合用户意图,我们引入"大纲"作为控制信号,该信号可由用户直接提供或基于自动生成结果进行修订。基于最先进多模态LLMs的综合实验表明,在我们的数据集上训练的模型展现出更强的科学图表理解性能,涵盖图表描述、图表分析及大纲推荐等任务。数据集、代码及模型已开源至https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl。