CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.

翻译：合成逼真的三维室内场景仍面临挑战，主要源于数据稀缺性以及同时满足全局建筑约束与局部语义一致性的困难。现有方法常忽视结构边界，或依赖全连接关系图而引入冗余生成误差。受人类设计认知启发，我们提出CasLayout——一种将联合场景生成任务分解为四个具有明确物理和语义角色的条件子阶段的级联扩散框架：（1）预测家具数量与类别，（2）优化物体尺寸与特征嵌入，（3）在潜空间中建模空间关系，以及（4）生成有向包围盒（OBB）。这种解耦架构降低了数据需求，并能灵活集成大型语言模型（LLM）与视觉语言模型（VLM）以完成零样本任务（如图像到场景生成）。为在复杂户型图中维持物理有效性，我们将建筑元素（如墙壁、门窗）明确建模为条件约束。此外，针对密集关系图的高熵问题，我们提出与人类空间描述一致的稀疏关系图形式化方法。通过双向变分自编码器（VAE）将这些稀疏图编码为紧凑潜空间，本框架增强了关系可控性，使生成布局更好地遵循功能组织。实验表明，CasLayout在保真度与多样性方面达到领先水平，同时在实际应用中实现了更强的可控性。