Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.
翻译:大型语言模型(LLM)因其能够生成如HTML或JSON等结构描述语言,即使在无法获取视觉信息的情况下,已被证明对布局生成是有效的。最近,LLM提供商已将这些模型发展为大型视觉语言模型(LVLM),展现出卓越的多模态理解能力。那么,我们如何利用这种多模态能力进行布局生成?为此,我们提出了基于LVLM的视觉感知自校正布局生成方法(VASCAR),用于内容感知的布局生成。在我们的方法中,LVLM通过参考渲染的布局图像(可视化为海报背景上的彩色边界框)来迭代优化其输出。在实验中,我们证明了我们的方法与Gemini模型结合后,无需任何额外训练,即可实现最先进的布局生成质量,优于现有的布局专用生成模型以及其他基于LLM的方法。