Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generation, catalyzing a new paradigm for front-end development. In particular, these models can directly transform visual designs into executable code, significantly improving the efficiency and adaptability of web development. Modern web applications are dynamic and interactive, featuring frequent user-page interactions. However, existing benchmarks largely evaluate the code generation of static webpages, ignoring the complex interactive behaviors in real-world applications. Besides, their evaluation criteria remain confined to visual fidelity and code structure, overlooking the interaction consistency between the generated and the reference webpages. To address these limitations, we introduce WebIGBench, the first benchmark designed to evaluate code generation for interactive webpages with complex interactions. By combining manually designed interaction paths with UI automation, we collected 103 complex webpages from real-world websites. This benchmark covers 5 popular interactive action types (e.g., click, input) involving 871 distinct interactive actions. Moreover, we propose a novel evaluation pipeline to address the gap in automated assessment of interactive actions. Extensive experiments on several representative MLLMs reveal the performance boundaries of current models in interactive webpage code generation using WebIGBench. The proposed benchmark is available at https://github.com/anoa12159-hue/WebIGBench_eval.
翻译:近年来,多模态大语言模型在多模态推理与代码生成方面取得了显著进展,催生了前端开发的新范式。具体而言,这类模型能够直接将视觉设计转化为可执行代码,显著提升了网页开发的效率与适应性。现代网页应用具有动态交互特性,频繁涉及用户与页面的交互行为。然而,现有基准测试主要以静态网页的代码生成为评估对象,忽略了真实应用中复杂的交互行为。此外,其评估标准仅局限于视觉保真度与代码结构,未能考虑生成网页与参考网页之间的交互一致性。为弥补上述不足,我们提出了WebIGBench——首个面向复杂交互式网页代码生成的基准测试。通过结合人工设计的交互路径与UI自动化技术,我们从真实网站中采集了103个复杂网页案例。该基准测试涵盖5种常见交互操作类型(如点击、输入),共涉及871个独立交互动作。同时,我们设计了一种新型评估流水线,以填补交互动作自动化评估的空白。针对多个代表性多模态大语言模型开展的大量实验,揭示了当前模型在WebIGBench交互式网页代码生成任务上的性能边界。本文提出的基准测试代码已开源:https://github.com/anoa12159-hue/WebIGBench_eval