Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization and require task-specific fine-tuning. To address these challenges, we propose ChartAssistant, a chart-based vision-language model for universal chart comprehension and reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering diverse chart-related tasks with basic and specialized chart types. It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text, followed by multitask instruction-following fine-tuning. This approach enables ChartAssistant to achieve competitive performance across various chart tasks without task-specific fine-tuning. Experimental results demonstrate significant performance gains over the state-of-the-art UniChart method, outperforming OpenAI's GPT-4V(ision) on real-world chart data. The code and data are available at https://github.com/OpenGVLab/ChartAst.
翻译:图表在数据可视化、理解数据模式以及辅助决策中发挥着重要作用。然而,图表中图形元素(如条形、折线)与文本组件(如标签、图例)的独特组合,给通用多模态模型带来了挑战。尽管基于图表数据训练的视觉-语言模型在理解方面表现出色,但其泛化能力不足,且需要针对特定任务进行微调。为解决这些问题,我们提出了ChartAssistant——一种基于图表的视觉-语言模型,用于通用图表理解与推理。ChartAssistant利用ChartSFT——一个涵盖多种图表相关任务(包括基础与专业图表类型)的综合性数据集。该模型通过两阶段训练流程:首先进行图表到表格解析的预训练,以对齐图表与文本;随后进行多任务指令跟随微调。这使得ChartAssistant无需针对特定任务微调,即可在各类图表任务中实现竞争性性能。实验结果表明,相较于最先进的UniChart方法,该方法显著提升了性能,并在真实图表数据上优于OpenAI的GPT-4V(ision)。代码与数据已开源至https://github.com/OpenGVLab/ChartAst。