Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems. Existing methods implicitly model object joint distributions and express object relations, hindering generation's controllability. We introduce InstructLayout, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 2D and 3D layout synthesis. The proposed semantic graph prior learns layout appearances and object distributions simultaneously, demonstrating versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 2D and 3D scene synthesis, we respectively curate two high-quality datasets of layout-instruction pairs from public Internet resources with large language and multimodal models. Extensive experimental results reveal that the proposed method outperforms existing state-of-the-art approaches by a large margin in both 2D and 3D layout synthesis tasks. Thorough ablation studies confirm the efficacy of crucial design components.
翻译:理解自然语言指令对于二维与三维布局合成系统而言是一项极具吸引力的特性。现有方法隐式地建模对象联合分布并表达对象关系,这阻碍了生成过程的可控性。我们提出了InstructLayout,一种新颖的生成式框架,它集成了语义图先验和布局解码器,以提升二维与三维布局合成的可控性与保真度。所提出的语义图先验能够同时学习布局外观和对象分布,并以零样本方式展现出在多种下游任务中的通用性。为了促进文本驱动的二维与三维场景合成基准测试,我们分别利用大型语言模型和多模态模型,从公开互联网资源中精心构建了两个高质量的布局-指令对数据集。大量实验结果表明,所提出的方法在二维和三维布局合成任务中均大幅优于现有的最先进方法。详尽的消融研究证实了关键设计组件的有效性。