We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.
翻译:我们提出SemLayoutDiff,一个用于多种房间类型多样化三维室内场景合成的统一模型。该模型引入了一种结合自上而下语义图与每个物体属性的场景布局表示。与先前无法以建筑约束为条件的方法不同,SemLayoutDiff采用了一种能够显式地以房间掩码为条件进行场景合成的分类扩散模型。它首先生成连贯的语义图,随后通过基于交叉注意力的网络来预测符合合成布局的家具摆放位置。我们的方法还考虑了门、窗等建筑元素,确保生成的家具布置保持实用且无遮挡。在3D-FRONT数据集上的实验表明,SemLayoutDiff能够生成空间连贯、真实且多样化的场景,性能优于先前方法。