Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.
翻译:生成高保真度的三维室内场景仍是一项重大挑战,其原因在于数据稀缺以及复杂空间关系建模的难度。当前方法往往难以扩展到训练分布之外的密集场景,或依赖缺乏精确空间推理能力的大语言模型/视觉语言模型。基于对象放置主要依赖局部依赖关系而非信息冗余的全局分布这一观察,本文提出Pair2Scene,一种新颖的程序化生成框架,该框架将学习到的局部规则与场景层级结构及基于物理的算法相结合。这些规则主要捕捉两类对象间关系,即遵循物理层级的支撑关系与反映语义联系的功能关系。我们通过一个网络对这些规则进行建模,该网络基于锚点对象的位置与几何信息,估计依赖对象的空间位置分布。为此,我们从现有场景数据中整理出数据集3D-Pairs来训练该模型。在推理过程中,我们的框架可在层级结构内递归应用模型,借助碰撞感知拒绝采样将局部规则协调为连贯的整体布局。大量实验表明,我们的框架在生成超出训练数据的复杂环境方面优于现有方法,同时保持了物理与语义的合理性。