Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
翻译:近期,多模态大语言模型在基于图表的视觉问答任务中展现出潜力,但其在无标注图表上的性能急剧下降——这类图表需要精确的视觉解读而非依赖文本捷径。为此,我们提出了ChartAgent,一种新颖的智能体框架,其直接在图表空间域内执行显式的视觉推理。与文本链式思维推理不同,ChartAgent将查询迭代分解为视觉子任务,并通过一系列专用操作(例如绘制标注、裁剪区域(如分割饼图扇区、隔离柱条)以及坐标轴定位)主动操纵和交互图表图像,利用一个图表专用视觉工具库来完成每个子任务。这种迭代推理过程紧密模拟了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试中取得了最先进的准确率,整体上以高达16.07%的绝对增益超越先前方法,在无标注且数值密集的查询上增益达17.31%。此外,我们的分析表明ChartAgent具有以下特点:(a)对多样化的图表类型均有效;(b)在不同视觉与推理复杂度层级上均获得最高分数;(c)作为一个即插即用框架,能够提升多种底层大语言模型的性能。我们的工作是首批利用工具增强的多模态智能体,为图表理解实现视觉基础推理的研究之一。