Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
翻译:草图是用户界面设计师表达早期概念的自然且易用的媒介。然而,现有的UI/UX自动化研究通常需要高保真输入(如Figma设计稿或详细截图),这限制了可访问性并阻碍了高效的设计迭代。为弥合这一差距,我们提出了Sketch2Code基准,用于评估最先进的视觉语言模型在将基础草图自动转换为网页原型方面的能力。除了端到端的基准测试,Sketch2Code还支持模拟真实世界设计流程的交互式智能体评估:基于VLM的智能体通过与模拟用户进行交流来迭代优化其生成结果,既可以被动接收反馈指令,也可以主动提出澄清性问题。我们对十种商业和开源模型进行了全面分析,结果表明Sketch2Code对现有VLMs具有挑战性;即使是最强大的模型也难以准确解读草图并构建能带来持续改进的有效问题。尽管如此,一项针对UI/UX专家的用户研究显示,他们显著更偏好主动提问模式而非被动接收反馈,这凸显了开发更有效的多轮对话智能体范式的必要性。