Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA's generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.
翻译:图表问答(CQA)旨在评估多模态大语言模型(MLLMs)对图表数据的视觉理解与推理能力。然而,现有基准测试大多仅检验表层解析能力(如读取标签和图例),而忽视了更深层的科学推理。本文提出DomainCQA——一个专注于构建领域特定CQA基准的框架,该框架同时强调视觉理解与知识密集型推理。它集成了复杂度感知的图表筛选、多层次问答生成及专家验证机制。在天文学领域的应用案例中,DomainCQA构建了AstroChart基准数据集,包含482张图表对应的1,690组问答对,揭示了21种MLLMs在细粒度感知、数值推理及领域知识融合方面存在的持续缺陷。基于AstroChart的微调显著提升了模型在基础与进阶任务上的表现。在生物化学、经济学、医学和社会科学领域的试点问答集进一步证明了DomainCQA的普适性。综合而言,本研究确立了DomainCQA作为构建与增强领域特定图表推理基准的统一流程。