We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.
翻译:我们提出VISTA(视觉规格到应用基准测试),一个用于评估基于大型语言模型的智能体在端到端网页应用生成能力的基准测试。与以往聚焦于算法任务的代码生成基准不同,VISTA针对现实世界中以用户界面为中心的开发场景,要求智能体根据非完整输入的规格说明生成功能完整且视觉连贯的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个维度变化:(1)仅文本且自由选择技术栈,(2)文本附带参考截图且限定三种指定技术栈,(3)文本附带参考截图且自由选择技术栈,(4)文本附带截图及精简Figma结构且限定单一指定技术栈,(5)文本附带截图及精简Figma结构且自由选择技术栈。为确保评估的鲁棒性,基准测试中每个页面均通过人工标注交互式UI组件和约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成场景中的已知局限性。评估方法结合基于DOM的参考匹配、特定行为的浏览器测试和基于CLIP的视觉相似度,共同衡量结构对齐性、行为完整性和整体视觉保真度。我们利用VISTA评估了来自两个模型家族和两个执行框架的四类智能体系统,发现视觉保真度与功能正确性在输入条件和智能体层面均呈现部分解耦特征,且智能体的编辑风格差异显著但基本与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可复现的基础。