Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
翻译:网页开发涉及将用户界面设计转化为功能性网页,由于HTML层级结构与样式的复杂性,这对初学者和经验丰富的开发者而言都可能具有挑战性。尽管大型语言模型在生成源代码方面已显示出潜力,但在从用户界面到HTML代码的生成中仍存在两大主要挑战:(1) 如何为大型语言模型有效表示HTML的层级结构;(2) 如何弥合用户界面设计的视觉特性与HTML代码的文本格式之间的差距。为应对这些挑战,我们提出了Waffle,一种新的微调策略。该策略采用结构感知注意力机制以增强大型语言模型对HTML结构的理解,并利用对比微调方法使大型语言模型对用户界面图像与HTML代码的理解保持一致。使用Waffle微调的模型在我们新构建的基准测试WebSight-Test及现有基准测试Design2Code上,展现出高达9.00个百分点(pp)的HTML匹配度提升、0.0982的CW-SSIM提升、32.99的CLIP提升以及27.12个百分点的LLEM提升,其性能优于当前的微调方法。