This paper presents a framework for generating 3D indoor scenes from text prompts. Existing methods often formulate scene synthesis as an object layout prediction problem conditioned on a single input modality, such as a text description, room shape, or scene graph. This design can lead to object collisions and limited functional plausibility, reducing its practical applicability. To address these limitations, we introduce a multi-stage pipeline that better reflects practical scene creation scenarios. Given a text prompt describing partial scene content, our method first uses graph diffusion to produce a contextually coherent scene graph and then predicts a realistic object layout. In addition, we incorporate lightweight human-object interaction priors to encourage human-centric and functional arrangements, with explicit spatial constraints to reduce interpenetration. Our approach generates coherent 3D scenes with viable layouts that better support human interaction. Experiments on the 3D-FRONT dataset demonstrate that our method achieves competitive or state-of-the-art performance compared with existing approaches, while improving the physical plausibility of generated scenes.
翻译:本文提出一个从文本提示生成三维室内场景的框架。现有方法通常将场景合成建模为基于单一输入模态(如文本描述、房间形状或场景图)的物体布局预测问题。这种设计易导致物体碰撞和功能合理性受限,降低了实际应用价值。为解决上述局限,我们引入一种更贴合实际场景创建流程的多阶段流水线。给定描述部分场景内容的文本提示,本方法首先通过图扩散生成上下文连贯的场景图,继而预测逼真的物体布局。此外,我们融入轻量级人-物交互先验以鼓励以人为中心的功能性布局,同时施加显式空间约束减少穿插。该方法能生成布局合理且更好支持人机交互的连贯三维场景。在3D-FRONT数据集上的实验表明:与现有方法相比,本方法在保持或达到最优性能的同时,显著提升了生成场景的物理合理性。