Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.
翻译:大型语言模型(LLM)因其能够生成结构描述语言(如HTML或JSON)而被证明在布局生成任务中具有有效性,即使在没有视觉信息输入的情况下亦是如此。近期,LLM提供商已将这些模型演进为大型视觉语言模型(LVLM),展现出卓越的多模态理解能力。那么,我们应如何利用这种多模态能力进行布局生成?为回答此问题,我们提出了基于LVLM的视觉感知自校正布局生成方法(VASCAR),用于内容感知的布局生成。在我们的方法中,LVLM通过参考渲染后的布局图像(以彩色边界框形式可视化于海报背景上)对其输出进行迭代优化。在实验中,我们证明了该方法与Gemini模型结合的有效性。在无需任何额外训练的情况下,VASCAR实现了最先进的布局生成质量,其性能超越了现有的专用布局生成模型以及其他基于LLM的方法。