Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To address these challenges, we propose ChartAssistant, a chart-based vision-language model for universal chart comprehension and reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering diverse chart-related tasks with basic (e.g. bars and pies) and specialized (e.g. radars, and bubbles) chart types. It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text, followed by multitask instruction-following fine-tuning. This approach enables ChartAssistant to achieve competitive performance across various chart tasks. Experimental results demonstrate significant performance gains over the state-of-the-art UniChart and Chartllama method, especially outperforming them on real-world chart data with zero-shot setting. The code and data are available at https://github.com/OpenGVLab/ChartAst.
翻译:图表在数据可视化、理解数据模式以及辅助决策中扮演着关键角色。然而,其独特的图形元素(如柱状图、折线图)与文本组件(如标签、图例)的组合,对通用多模态模型构成了挑战。尽管基于图表数据训练的视觉语言模型在理解任务上表现优异,但其泛化能力不足。为应对这些问题,我们提出ChartAssistant——一种基于图表的视觉语言模型,旨在实现通用的图表理解与推理。该模型依托于ChartSFT数据集,该数据集覆盖了多种图表相关任务,包含基础图表类型(如柱状图、饼图)与专业图表类型(如雷达图、气泡图)。ChartAssistant采用两阶段训练流程:首先通过图表到表格的解析预训练对齐图表与文本,随后进行多任务指令跟随微调。这一策略使ChartAssistant在各类图表任务中均能取得有竞争力的性能。实验结果表明,相比当前最先进的UniChart与Chartllama方法,本模型在零样本设置下对真实世界图表数据的处理性能显著提升。代码与数据已开源至:https://github.com/OpenGVLab/ChartAst。