Automated reverse engineering of HTML/CSS code from UI screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code. They are initialized by pre-trained models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires end-to-end finetuning, which aims to minimize the visual discrepancy between the code-rendered webpage and the original screenshot. However, the rendering is non-differentiable and causes costly overhead. We address this problem by actor-critic fine-tuning where a visual critic without rendering (ViCR) is developed to predict visual discrepancy given the original and generated code. To train and evaluate our models, we created two synthetic datasets of varying complexity, with over 75,000 unique (code, screenshot) pairs. We evaluate the UI-to-Code performance using a combination of automated metrics such as MSE, BLEU, IoU, and a novel htmlBLEU score. ViCT outperforms a strong baseline model DiT-GPT2, improving IoU from 0.64 to 0.79 and lowering MSE from 12.25 to 9.02. With much lower computational cost, it can achieve comparable performance as when using a larger decoder such as LLaMA.
翻译:从UI截图自动逆向生成HTML/CSS代码是一个重要且具有挑战性的问题,在网站开发和设计领域有广泛应用。本文提出了一种新型视觉-代码Transformer(ViCT),由处理截图的视觉编码器和生成代码的语言解码器组成。这两部分通过预训练模型(如ViT/DiT和GPT-2/LLaMA)初始化,但为实现两种模态的对齐需要进行端到端微调,其目标是使代码渲染后的网页与原始截图之间的视觉差异最小化。然而,渲染过程不可微且会导致高昂的计算开销。我们通过演员-评论家(actor-critic)微调方法解决该问题,其中开发了一种无需渲染的视觉评论家(ViCR),用于根据原始代码和生成代码预测视觉差异。为训练和评估模型,我们构建了两个复杂度不同的合成数据集,包含超过75,000个独特的(代码,截图)对。我们采用MSE、BLEU、IoU以及新型htmlBLEU分数等自动化指标组合评估UI到代码的生成性能。ViCT显著优于强基线模型DiT-GPT2,将IoU从0.64提升至0.79,并将MSE从12.25降低至9.02。在计算成本大幅降低的情况下,该模型可达到与使用更大解码器(如LLaMA)时相当的性能。