Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
翻译:科学图表传达明确的结构信息,然而现代文本到图像模型常生成视觉合理但结构错误的结果。现有基准要么依赖以图像为中心或对结构不敏感的主观指标,要么评估中间符号表示而非最终渲染图像,导致基于像素的图表生成研究不足。我们提出SciFlow-Bench,一个以结构为先的基准,用于直接从像素级输出评估科学图表生成。该基准基于真实科学PDF构建,将每个源框架图与规范的真实图配对,并通过闭环往返协议将模型作为黑盒图像生成器进行评估——该协议将生成的图表图像逆向解析回结构化图以进行比较。此设计强制要求基于结构可恢复性而非仅视觉相似性进行评估,并通过协调规划、感知和结构推理的分层多智能体系统实现。实验表明,保持结构正确性仍是一项根本性挑战,尤其对于具有复杂拓扑的图表,这凸显了结构感知评估的必要性。