Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark.
翻译:图表解析因样式、数值、文本等元素的多样性而面临重大挑战。即便拥有数十亿参数的高级大型视觉语言模型也难以令人满意地处理此类任务。为此,我们提出OneChart:一种专门用于图表信息结构提取的可靠智能体。与主流大型视觉语言模型类似,OneChart采用自回归主体。独特之处在于,为增强输出数值部分的可靠性,我们在总标记序列起始处引入一个辅助标记,并额外配备一个解码器。该经过数值优化的辅助标记使得后续用于图表解析的标记能够通过因果注意力捕获增强的数值特征。此外,借助辅助标记,我们设计了一种自我评估机制,使模型能够通过为生成内容提供置信度分数来衡量其图表解析结果的可靠性。与当前最先进的图表解析模型(如DePlot、ChartVLM、ChartAst)相比,OneChart在多个公共基准测试的图表结构提取平均精度上显著领先,尽管其仅拥有2亿参数。同时,作为图表解析智能体,它在下游ChartQA基准测试中为流行的大型视觉语言模型(LLaVA-1.6)带来了10%以上的准确率提升。