Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to interpreting chart figures. This is mainly due to the lack of relevant multi-modal instruction tuning datasets. In this article, we create a high-quality instruction-tuning dataset leveraging GPT-4. We develop a multi-step data generation process in which different steps are responsible for generating tabular data, creating chart figures, and designing instruction tuning data separately. Our method's flexibility enables us to generate diverse, high-quality instruction-tuning data consistently and efficiently while maintaining a low resource expenditure. Additionally, it allows us to incorporate a wider variety of chart and task types not yet featured in existing datasets. Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset. ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks. Additionally, ChartLlama significantly improves upon the baseline in our specially compiled chart dataset, which includes new chart and task types. The results of ChartLlama confirm the value and huge potential of our proposed data generation method in enhancing chart comprehension.
翻译:多模态大语言模型在大多数视觉语言任务中已展现出令人瞩目的性能。然而,这类模型通常缺乏对特定领域数据的理解能力,尤其是在解读图表图形方面。这主要是由于缺乏相关多模态指令微调数据集所致。本文利用GPT-4构建了一个高质量的指令微调数据集。我们开发了一种多步骤数据生成流程,其中不同步骤分别负责生成表格数据、创建图表图形以及设计指令微调数据。该方法兼具灵活性,能够以较低的资源消耗持续高效地生成多样化、高质量的指令微调数据。此外,该方法使得我们能够纳入现有数据集尚未涵盖的更广泛的图表类型和任务类型。接着,我们介绍了ChartLlama——一个基于我们创建的数据集训练的多模态大语言模型。ChartLlama在ChartQA、图表到文本以及图表提取评估基准上均优于所有先前方法。此外,在我们专门汇编的包含新型图表和任务类型的图表数据集中,ChartLlama显著提升了基线性能。ChartLlama的结果证实了我们提出的数据生成方法在增强图表理解能力方面的价值与巨大潜力。