Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
翻译:大型语言模型(LLMs)的最新进展使智能体系统能够将自然语言意图转化为可执行的科学可视化(SciVis)任务。尽管发展迅速,但该领域仍缺乏一个原则化且可复现的基准,用于在真实的多步分析场景中评估这些新兴的SciVis智能体。我们提出了SciVisAgentBench,一个全面且可扩展的基准,用于评估科学数据分析与可视化智能体。该基准基于一个结构化的分类体系,涵盖四个维度:应用领域、数据类型、复杂度级别和可视化操作。当前包含108个由专家设计的案例,覆盖多种SciVis场景。为保障评估可靠性,我们引入了一种以结果为重心的多模态评估流程,将基于LLM的评判与确定性评估器(包括图像指标、代码检查器、规则验证器及案例特定评估器)相结合。我们还与12位SciVis专家开展了效度研究,以检验人工与LLM评判的一致性。利用这一框架,我们评估了代表性SciVis智能体及通用编程智能体,建立了初始基线并揭示了能力差距。SciVisAgentBench设计为一个动态基准,旨在支持系统性比较、诊断失败模式,并推动智能体驱动的SciVis领域进步。该基准可访问 https://scivisagentbench.github.io/。