Text-driven 3D indoor scene generation could be useful for gaming, film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which is able to generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. %how to model the room that takes into account both scene texture and geometry at the same time. To this end, Our proposed method consists of two stages, a `Layout Generation Stage' and an `Appearance Generation Stage'. The `Layout Generation Stage' trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the `Appearance Generation Stage' employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. In this way, we achieve a high-quality 3D room with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive editing-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
翻译:文本驱动的三维室内场景生成在游戏、电影产业及增强现实/虚拟现实应用中具有重要价值。然而,现有方法既无法准确捕捉房间布局,也不支持对室内单个物体的灵活编辑。为解决这些问题,我们提出Ctrl-Room方法,该方法仅需文本提示即可生成具有设计师级布局和高保真纹理的逼真三维房间。此外,Ctrl-Room支持对家具进行缩放、移动等多样化交互式编辑操作。我们的核心洞察在于将布局与外观建模进行分离。为此,所提方法包含两个阶段:布局生成阶段和外观生成阶段。布局生成阶段通过整体场景编码参数化训练文本条件扩散模型学习布局分布;外观生成阶段则利用微调后的ControlNet,基于三维场景布局与文本提示生成生动的全景房间图像。通过这种方式,我们获得了兼具合理布局与鲜活纹理的高质量三维房间。得益于场景编码参数化方法,可通过掩码引导编辑模块轻松编辑生成的三维房间模型,无需昂贵的编辑专用训练。在Structured3D数据集上的大量实验表明,本方法在通过自然语言提示生成更合理、视图一致且可编辑的三维房间方面优于现有方法。