Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50\% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at https://github.com/ProjectNUWA/LayoutNUWA.
翻译:图形布局生成作为一个日益发展的研究领域,在用户参与度和信息感知方面发挥着重要作用。现有方法主要将布局生成视为数值优化任务,侧重于定量方面,而忽略了布局的语义信息,例如各布局元素之间的关系。在本文中,我们提出 LayoutNUWA,这是首个将布局生成视为代码生成任务以增强语义信息并利用大型语言模型(LLMs)隐藏布局专长的模型。更具体地说,我们开发了一种代码指令调优(CIT)方法,包含三个相互关联的模块:1)代码初始化(CI)模块量化数值条件并将其初始化为带有策略性放置掩码的HTML代码;2)代码补全(CC)模块利用LLMs的格式化知识填充HTML代码中的掩码部分;3)代码渲染(CR)模块将补全后的代码转换为最终布局输出,确保布局生成过程具有高度可解释性和透明性,可直接将代码映射为可视化布局。我们在多个数据集上取得了显著的最优性能(甚至超过50%的提升),展示了LayoutNUWA的强大能力。我们的代码可在https://github.com/ProjectNUWA/LayoutNUWA获取。