Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.
翻译:近期视觉语言模型(VLMs)在从视觉输入生成网页方面取得了显著进展,但现有评估主要集中于短篇幅、单屏且多为静态的网页。我们提出LongWebBench,一个从结构与功能两个维度评估长视域网页生成的基准。LongWebBench包含490个真实世界的长网页用于结构保真度评估,以及基于129个网页的507个目标导向交互任务用于功能评估。它采用两种互补的评估协议:用于评估长程结构连贯性的多维VLM指标,以及用于端到端功能验证的DOM增强智能体流水线。我们进一步通过人工一致性分析检验自动评估协议。在单图像与多图像设置下,对当前最先进的开源及商业VLMs进行实验,结果表明结构保真度随网页长度增加而下降,而视觉上合理的生成往往无法支持可执行的多步交互。这些结果凸显了超越视觉相似性、将可执行交互作为核心标准来评估长网页生成的必要性。我们的代码与数据见https://github.com/zheny2751-dotcom/LongWebBench。