Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Sukmin Yun,Haokun Lin,Rusiru Thushara,Mohammad Qazim Bhat,Yongxin Wang,Zutao Jiang,Mingkai Deng,Jinhong Wang,Tianhua Tao,Junbo Li,Haonan Li,Preslav Nakov,Timothy Baldwin,Zhengzhong Liu,Eric P. Xing,Xiaodan Liang,Zhiqiang Shen

from arxiv, Website at https://mbzuai-llm.github.io/webpage2code/

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.

翻译：多模态大语言模型（MLLMs）在图像、视频和音频等多种模态的理解与生成任务中已展现出令人瞩目的成功。然而，当前MLLMs在理解网页截图并生成对应HTML代码方面的能力却出人意料地薄弱。为解决此问题，我们提出了Web2Code——一个包含用于指令微调的大规模新型网页转代码数据集，以及用于评估MLLMs网页理解与HTML代码转换能力的评估框架。在数据集构建方面，我们利用预训练大语言模型增强现有网页转代码数据集，同时生成多样化的新网页渲染图像集合。具体而言，输入为网页图像及指令，输出则为对应网页的HTML代码。我们进一步在输出中包含了关于网页内容的多样化自然语言问答对，以实现对网页内容更全面的理解。为评估模型在此类任务中的性能，我们开发了一套评估框架，用于测试MLLMs在网页理解和网页转代码生成方面的能力。大量实验表明，我们提出的数据集不仅有利于所提出的特定任务，在通用视觉领域亦能带来增益，而现有数据集则会导致性能下降。我们希望本工作能为开发适用于基于网页的内容生成与任务自动化的通用MLLMs作出贡献。我们的数据与代码将在https://github.com/MBZUAI-LLM/web2code 公开。