Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

翻译：从用户界面到代码（UI2Code）的目标是生成能够忠实重建给定输入UI的可执行代码。现有工作主要聚焦于网页和移动屏幕，而对应用控件的探索尚不充分。与具有丰富层次化上下文的网页或移动端UI不同，控件是紧凑、无上下文的微交互界面，通过密集布局和图标在严格空间约束下汇总关键信息。此外，虽然（图像，代码）对在网页或移动端UI中广泛可用，但控件设计具有专有性且缺乏可访问的标记语言。我们将此设定形式化为控件到代码任务，并引入纯图像控件基准以及细粒度、多维度的评估指标体系。基准测试表明，尽管通用多模态大语言模型在性能上优于专用UI2Code方法，但其生成的代码仍存在不可靠与视觉不一致的问题。为解决这些局限，我们开发了一个联合推进感知理解与结构化代码生成的基线方案。在感知层面，我们遵循控件设计原则，将原子组件组装为完整布局，并配备图标检索与可复用可视化模块。在系统层面，我们设计了一个端到端基础设施——WidgetFactory，包含与框架无关的控件定制领域特定语言及其编译器，可将其转换为多种前端实现（如React、HTML/CSS）。自适应渲染模块进一步优化空间尺寸以满足紧凑性约束。这些贡献共同显著提升了视觉保真度，为未来控件到代码研究奠定了强基线基础与统一基础设施。