Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

翻译：我们提出GraphicDesignBench（GDB），这是首个专门针对AI模型在专业图形设计全流程任务评估而设计的综合基准套件。与现有聚焦自然图像理解或通用文生图合成的基准不同，GDB瞄准专业设计工作的独特挑战：将沟通意图转化为结构化布局、呈现忠实排版的文字、处理分层合成、生成有效矢量图形以及进行动画推理。该套件包含按五个维度组织的50项任务：布局、排版、信息图、模板与设计语义及动画，每项任务均在理解与生成两种设置下进行评估，并基于从LICA分层合成数据集提取的真实设计模板。我们使用涵盖空间精度、感知质量、文字保真度、语义对齐及结构有效性的标准化度量分类法，对一系列前沿闭源模型进行评估。结果表明，当前模型在专业设计的核心挑战上表现不足：复杂布局的空间推理、忠实矢量代码生成、细粒度排版感知及动画的时间分解仍基本未解决。尽管高层语义理解已可触及，但随着任务要求更精确、更具结构性和组合意识，差距急剧扩大。GDB为追踪AI系统向胜任设计协作者能力的进展提供了严谨可复现的测试平台。完整评估框架已公开。