Design-to-code translates high-fidelity UI designs into executable front-end implementations, but progress remains hard to compare due to inconsistent datasets, toolchains, and evaluation protocols. We introduce 1D-Bench, a benchmark grounded in real e-commerce workflows, where each instance provides a reference rendering and an exported intermediate representation that may contain extraction errors. 1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day. Models take both as input, using the intermediate representation as structural cues while being evaluated against the reference rendering, which tests robustness to intermediate representation defects rather than literal adherence. 1D-Bench requires generating an executable React codebase under a fixed toolchain with an explicit component hierarchy, and defines a multi-round setting in which models iteratively apply component-level edits using execution feedback. Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity. We further conduct a pilot study on post-training with synthetic repair trajectories and reinforcement learning based editing, and observe limited and unstable gains that may stem from sparse terminal rewards and high-variance file-level updates.
翻译:设计到代码转换旨在将高保真用户界面设计转化为可执行的前端实现,但由于数据集、工具链和评估协议的不一致,其进展仍难以比较。我们提出了1D-Bench,这是一个基于真实电子商务工作流程的基准测试,其中每个实例提供一个参考渲染图和一个可能包含提取错误的导出中间表示。1D是“一天”的缩写,代表在一天内高效完成设计到代码任务。模型以两者作为输入,使用中间表示作为结构线索,同时以参考渲染图为评估标准,这测试的是对中间表示缺陷的鲁棒性,而非字面遵循程度。1D-Bench要求在固定工具链下生成具有明确组件层次结构的可执行React代码库,并定义了一个多轮设置,在该设置中模型利用执行反馈迭代地应用组件级编辑。对商业和开源多模态模型的实验表明,迭代编辑通常通过提高渲染成功率和视觉相似度来提升最终性能。我们进一步对基于合成修复轨迹的后训练和基于强化学习的编辑进行了初步研究,观察到有限且不稳定的增益,这可能源于稀疏的终端奖励和高方差的文件级更新。