Synthesizing realistic and diverse indoor 3D scene layouts in a controllable fashion opens up applications in simulated navigation and virtual reality. As concise and robust representations of a scene, scene graphs have proven to be well-suited as the semantic control on the generated layout. We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans. We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene, and use these as the building blocks of our model. Our model, leverages graph transformers to estimate the size, dimension and orientation of the objects in a room while satisfying relationships in the given scene graph. Our experiments shows self-attention layers leads to sparser (7.9x compared to Graphto3D) and more diverse scenes (16%).
翻译:以可控方式合成逼真且多样化的室内三维场景布局,为模拟导航与虚拟现实应用开辟了新途径。作为场景的简洁且鲁棒表示形式,场景图已被证明非常适合作为生成布局的语义控制手段。我们提出了一种条件变分自编码器(cVAE)模型的变体,用于从场景图和平面图合成三维场景。我们利用自注意力层的特性来捕捉场景中物体之间的高层级关系,并将其作为模型的基本构建模块。我们的模型借助图变换器,在满足给定场景图中关系约束的同时,估算房间内物体的尺寸、维度及朝向。实验表明,与Graphto3D相比,自注意力层能够生成更稀疏(稀疏度提升7.9倍)且更多样化(多样性提升16%)的场景。