Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general evaluation. Moreover, WebCoderBench adopts human-preference-aligned weights over metrics to yield interpretable overall scores. Experiments across 12 representative LLMs and 2 LLM-based agents show that there exists no dominant model across all evaluation metrics, offering an opportunity for LLM developers to optimize their models in a targeted manner for a more powerful version.
翻译:Web应用程序(Web应用)已成为大型语言模型(LLM)展示其代码生成能力和商业潜力的关键领域。然而,为LLM生成的Web应用构建基准测试仍然面临挑战,这源于对真实用户需求的需求、不依赖基准实现或测试用例的可泛化评估指标,以及可解释的评估结果。为应对这些挑战,我们引入了WebCoderBench,这是首个基于真实世界收集、可泛化且可解释的Web应用生成基准测试。WebCoderBench包含1,572个真实用户需求,涵盖反映真实用户意图的多种模态和表达风格。WebCoderBench提供了涵盖9个维度的24个细粒度评估指标,结合基于规则和LLM-as-a-judge范式,实现全自动、客观且通用的评估。此外,WebCoderBench采用与人类偏好对齐的指标权重,以产生可解释的综合得分。在12个代表性LLM和2个基于LLM的智能体上的实验表明,不存在在所有评估指标上均占主导地位的模型,这为LLM开发者提供了机会,使其能够有针对性地优化模型以打造更强大的版本。