In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.
翻译:在文本驱动的三维场景生成中,物体布局作为一种关键的中间表示,连接着高层级的语言指令与详细的几何输出。它不仅为确保物理合理性提供了结构蓝图,还支持语义可控性与交互式编辑。然而,当前三维室内布局生成模型的学习能力受限于现有数据集的规模、多样性和标注质量。为解决此问题,我们提出了M3DLayout,一个用于三维室内布局生成的大规模多源数据集。M3DLayout包含21,367个布局和超过433,000个物体实例,整合了三个不同的来源:真实世界扫描、专业CAD设计以及程序化生成的场景。每个布局都配有详细的结构化文本描述,涵盖全局场景摘要、大型家具的关系性摆放以及小型物品的细粒度布置。这种多样且标注丰富的资源使得模型能够学习各种室内环境中复杂的空间与语义模式。为评估M3DLayout的潜力,我们建立了一个基准测试,同时使用了文本条件扩散模型和文本条件自回归模型。实验结果表明,我们的数据集为训练布局生成模型提供了坚实的基础。其多源构成增强了多样性,特别是通过Inf3DLayout子集提供了丰富的小物体信息,从而能够生成更复杂和细致的场景。我们希望M3DLayout能够成为推动文本驱动的三维场景合成研究的有价值资源。所有数据集和代码将在论文被接受后公开。