In this work, we present SciGraphQA, a synthetic multi-turn question-answer dataset related to academic graphs. SciGraphQA is 13 times larger than ChartVQA, the previously largest chart-visual question-answering dataset. It is also the largest open-sourced chart VQA dataset with non-synthetic charts. To build our dataset, we selected 290,000 Computer Science or Machine Learning ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate 295K samples of open-vocabulary multi-turn question-answering dialogues about the graphs. As context, we provided the text-only Palm-2 with paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph itself, obtaining dialogues with an average 2.23 question-answer turns for each graph. We asked GPT-4 to assess the matching quality of our question-answer turns given the paper's context, obtaining an average rating of 8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our dataset, finding LLaVA-13B being the most performant with a CIDEr score of 0.08. We further enriched the question prompts for LLAVA by including the serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset, we also fine-tuned LLaVa using our dataset, reaching a substantially higher CIDEr score of 0.26. We anticipate further accuracy improvement by including segmentation mask tokens and leveraging larger LLM backbones coupled with emergent prompting techniques. Our code and data are open-sourced.
翻译:本文提出了SciGraphQA,一个面向学术图表的合成多轮问答数据集。SciGraphQA的规模是此前最大图表视觉问答数据集ChartVQA的13倍,同时也是基于非合成图表的最大开源图表VQA数据集。为构建该数据集,我们选取了2010年至2020年间发表的29万篇计算机科学或机器学习领域的ArXiv论文,并采用Palm-2生成了29.5万组关于图表的开放词汇多轮问答对话样本。在上下文输入中,我们向纯文本模式的Palm-2提供了论文标题、摘要、提及图表的段落,以及图表自身的富文本上下文数据,最终每个图表平均获得2.23轮问答对。我们使用GPT-4基于论文上下文对问答对的匹配质量进行评估,在3000个测试样本上获得平均8.7/10的评分。我们评估了主流多模态大语言模型(如LLaVa、mPLUGowl、BLIP-2和openFlamingo)在该数据集上的零样本能力,发现LLaVA-13B以0.08的CIDEr得分表现最佳。通过引入DePlot模型从图表中提取序列化数据表作为LLaVA的增强问题提示,我们将LLaVA的零样本CIDEr得分提升至0.15。为验证数据集有效性,我们进一步使用该数据集微调LLaVA,获得了0.26的显著更高CIDEr得分。我们预期通过引入分割掩码标记并借助新兴提示技术扩大LLM骨干网络规模,可进一步提升精度。相关代码与数据均已开源。