Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
翻译:利用视觉-语言模型(VLMs)进行网页开发,为提升效率并开启无代码解决方案提供了有前景的策略:通过提供用户界面的截图或草图,VLM可生成复现该界面的代码(例如HTML语言)。尽管VLM在各项任务中取得进展,但将截图转换为对应HTML这一具体挑战尚未得到充分探索。我们认为这主要源于缺乏高质量且适用的数据集。本研究提出WebSight——一个包含200万对HTML代码及其对应截图的合成数据集。我们基于该数据集对基础VLM进行微调,展示了将网页截图转化为功能型HTML代码的能力。为加速该领域研究,我们已开源WebSight数据集。