Charts are a powerful tool for visually conveying complex data, but their comprehension poses a challenge due to the diverse chart types and intricate components. Existing chart comprehension methods suffer from either heuristic rules or an over-reliance on OCR systems, resulting in suboptimal performance. To address these issues, we present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks. Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks. By learning the rules of charts automatically from annotated datasets, our approach eliminates the need for manual rule-making, reducing effort and enhancing accuracy.~We also introduce a data variable replacement technique and extend the input and position embeddings of the pre-trained model for cross-task training. We evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks, demonstrating its superiority over existing methods. Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model. Moreover, our approach offers opportunities for plug-and-play integration with mainstream LLMs such as T5 and TaPas, extending their capability to chart comprehension tasks. The code is available at https://github.com/zhiqic/ChartReader.
翻译:图表是直观传达复杂数据的强大工具,但由于图表类型多样且组件繁杂,其理解面临挑战。现有的图表理解方法或依赖启发式规则,或过度依赖OCR系统,导致性能欠佳。为解决这些问题,我们提出了ChartReader——一个无缝集成图表解渲染与理解任务的统一框架。本方法包含基于Transformer的图表组件检测模块,以及用于图表到X任务的扩展预训练视觉语言模型。通过从标注数据集中自动学习图表规则,本方法无需人工制定规则,既减少了工作量又提升了准确性。我们引入了数据变量替换技术,并扩展了预训练模型的输入嵌入与位置嵌入以支持跨任务训练。我们在Chart-to-Table、ChartQA和Chart-to-Text任务上评估了ChartReader,证明了其相较于现有方法的优越性。本框架能显著减少图表分析中的人工投入,向通用图表理解模型迈出关键一步。此外,本方法支持与T5、TaPas等主流大语言模型进行即插即用式集成,拓展了其在图表理解任务中的能力。代码已开源至https://github.com/zhiqic/ChartReader。