Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.
翻译:近期,大语言模型(LLMs)强大的文本生成能力催生了众多辅助论文阅读乃至写作的工具。然而,大语言模型或多模态大语言模型在图表分析方面的能力薄弱,严重限制了其在科学研究论文写作中的应用场景。本研究针对学术论文写作这一场景,致力于增强多模态大语言模型的图表分析能力,以打造功能更全面的写作辅助工具。通过解析高质量论文的LaTeX源文件,我们精心构建了多模态图表理解数据集M-Paper。通过将论文中的图表与相关段落进行对齐,我们构建了用于训练与评估的专业图表分析样本。M-Paper是首个支持多种科学图表(包括图像或LaTeX代码格式的图形和表格)联合理解的数据集。此外,为使辅助工具更好地契合用户意图,我们引入"提纲(outline)"作为控制信号——该信号可由用户直接提供,也可基于自动生成的提纲进行修订。基于当前最先进的多模态大语言模型的综合实验表明,在本数据集上训练的模型展现出更强的科学图表理解能力,包括图表描述生成、图表分析及提纲推荐。数据集、代码及模型已开源至https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl。