Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods directly model object joint distributions and express object relations implicitly within a scene, thereby hindering the controllability of generation. We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns scene appearances and layout distributions, exhibiting versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models. Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Project page: https://chenguolin.github.io/projects/InstructScene.
翻译:理解自然语言指令是三维室内场景合成系统的一项迷人特性。现有方法直接在场景中建模对象联合分布并隐式表达对象关系,从而限制了生成的可控性。我们提出InstructScene,一种新颖的生成框架,该框架集成了语义图先验和布局解码器,以提高三维场景合成的可控性和保真度。所提出的语义图先验联合学习场景外观和布局分布,以零样本方式展现出对各种下游任务的通用性。为促进文本驱动三维场景合成的基准测试,我们利用大语言模型和多模态模型精心构建了一个高质量的场景-指令对数据集。大量实验结果表明,所提方法大幅超越了现有最先进方法。详尽的消融研究证实了关键设计组件的有效性。项目页面:https://chenguolin.github.io/projects/InstructScene。