We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.
翻译:我们提出了SemLayoutDiff,这是一个用于跨多种房间类型合成多样化3D室内场景的统一模型。该模型引入了一种结合俯视语义图与各物体属性的场景布局表示方法。与先前无法以建筑结构为条件约束的方法不同,SemLayoutDiff采用了一种能够显式地以房间掩码为条件进行场景合成的分类扩散模型。该模型首先生成连贯的语义图,随后通过基于交叉注意力的网络预测符合合成布局的家具摆放位置。我们的方法还考虑了门、窗等建筑元素,确保生成的家具布局兼具实用性与无遮挡性。在3D-FRONT数据集上的实验表明,SemLayoutDiff能生成空间连贯、真实且多样化的场景,其性能优于现有方法。