Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.
翻译:摘要:在视觉生成中实现高程度用户可控性通常需要精细的输入,例如布局。然而,与简单的文本输入相比,此类输入给用户带来了沉重负担。为解决这一问题,我们研究了大语言模型(LLMs)如何通过从文本条件生成布局来充当视觉规划器,从而与视觉生成模型协作。我们提出LayoutGPT,一种在样式表语言中组合上下文视觉演示以增强LLMs视觉规划能力的方法。LayoutGPT能在多个领域生成合理的布局,涵盖2D图像到3D室内场景。此外,LayoutGPT在将数值和空间关系等具有挑战性的语言概念转换为布局排列以实现忠实于文本的图像生成方面展现出卓越性能。当与下游图像生成模型结合时,LayoutGPT在数值与空间正确性的视觉布局设计上,性能较文本到图像模型/系统提升20%-40%,并达到与人类用户相当的水平。最后,LayoutGPT在3D室内场景合成中达到与监督方法相当的性能,证明了其在多个视觉领域的有效性和潜力。