Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs' framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at https://github.com/WebPAI/DesignBench.
翻译:多模态大语言模型(MLLMs)在自动化前端工程(例如,根据视觉设计生成用户界面代码)中展现出卓越能力。然而,现有的前端用户界面代码生成基准存在以下局限性:(1)尽管基于框架的开发已成为现代前端编程的主流,当前基准未能纳入主流开发框架。(2)现有评估仅关注用户界面代码生成任务,而实际用户界面开发涉及多次迭代,包括精炼编辑和问题修复。(3)当前基准采用单维评估,缺乏对任务难度、输入上下文变化以及深度代码级分析等影响因素的考察。为弥补这些差距,我们提出了DesignBench,一个用于评估MLLMs在自动化前端工程中能力的多框架、多任务评估基准。DesignBench涵盖三种广泛使用的用户界面框架(React、Vue和Angular)以及原生HTML/CSS,并在真实世界开发流程中评估三个关键前端任务(生成、编辑和修复)。DesignBench包含900个网页样本,覆盖11个主题、9种编辑类型和6类问题,支持从多个维度对MLLM性能进行详细分析。我们的系统评估揭示了MLLMs在框架特定限制、任务相关瓶颈以及不同条件下性能变化方面的关键见解,为自动化前端开发的未来研究提供了指导。我们的代码与数据可在 https://github.com/WebPAI/DesignBench 获取。