Indoor scene generation aims at creating shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.
翻译:室内场景生成旨在空间合理的布局内创建形状兼容、风格一致的家具排布。然而,现有方法大多侧重于生成合理的家具布局,而未融入与单个家具相关的具体细节。为解决这一局限,我们提出一种两阶段模型,通过将家具编码为锚点潜变量表示,将形状先验融入室内场景生成过程。在第一阶段,我们采用离散向量量化将家具编码为锚点潜变量。基于该表示,家具的形状与位置信息通过位置、尺寸、朝向、类别及我们的锚点潜变量的拼接进行表征。在第二阶段,我们利用Transformer模型自回归地预测室内场景。得益于所提出的锚点潜变量表示,我们的生成模型能够产生形状兼容且风格一致的家具排布,并合成多样形态的家具。此外,我们的方法可支持多种人机交互应用,例如风格一致的场景补全、物体不匹配修正以及可控的物体级编辑。在3D-Front数据集上的实验结果表明,即使不依赖形状检索,我们的方法也能生成比现有方法更一致且兼容的室内场景。同时,广泛的消融研究验证了我们室内场景生成模型中设计选择的有效性。