Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
翻译:多模态大语言模型的最新进展已实现从自然语言描述自动生成结构化布局。现有方法通常遵循纯代码范式,先生成表征布局的代码,再由图形引擎渲染为最终图像。然而,此类方法对渲染后的视觉输出缺乏感知,难以保障可读性与美学质量。本文识别出视觉反馈是布局生成的关键要素,并提出视觉反馈布局模型——一种利用视觉反馈进行迭代优化的自改进框架。该模型具备自适应反思生成能力,能借助视觉信息识别先前问题,通过迭代生成直至达到满意质量。这一能力通过结合OCR准确率的视觉化奖励模型的强化学习实现。通过仅对最终生成结果进行奖励,可有效激发模型的迭代与反思生成能力。跨多个基准的实验表明,VFLM持续优于先进的多模态大语言模型、现有布局模型及纯代码基线方法,验证了视觉反馈对面向设计的多模态大语言模型的关键作用。我们的代码与数据已开源至https://github.com/FolSpark/VFLM。