Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
翻译:截图到代码生成旨在将用户界面截图转换为可执行的前端代码,忠实地复现目标布局与样式。现有的多模态大语言模型直接从截图完成这一映射,但其训练过程中并未观察生成代码的视觉输出结果。相比之下,人类开发者会迭代渲染其实现,将其与设计稿进行对比,并学习视觉差异与代码修改之间的关联。受此过程启发,我们提出VisRefiner训练框架,使模型能够从渲染预测结果与参考设计之间的视觉差异中学习。我们构建了差异对齐监督机制,将视觉差异与对应的代码编辑操作相关联,使模型能够理解外观变化如何由实现修改产生。在此基础上,我们引入强化学习阶段进行自优化,模型通过同时观察渲染输出与目标设计,识别其视觉差异,并据此更新代码来改进生成结果。实验表明,VisRefiner显著提升了单步生成质量与布局保真度,同时赋予模型强大的自优化能力。这些结果证明了通过视觉差异学习对推进截图到代码生成任务的有效性。