Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Early works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in shape generation with powerful generative models, such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which implies that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D scenes from scene graph. To enrich the representation capability of the given scene graph inputs, large language model is utilized to explicitly aggregate the global graph features with local relationship features. With a unified graph convolution network (GCN), graph features are extracted from scene graphs updated via joint layout-shape distribution. During scene generation, an IoU-based regularization loss is introduced to constrain the predicted 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
翻译:组合式三维场景合成因紧密模拟真实世界中多物体环境的复杂性,在机器人、电影和电子游戏等多个行业具有广泛应用。早期工作通常采用基于形状检索的框架,但这类方法天然受限于形状多样性不足的问题。近年来,随着扩散模型等强大生成模型在形状生成领域取得进展,形状保真度得以提升。然而,现有方法将三维形状生成与布局生成分开处理,导致合成场景常出现布局碰撞问题,这意味着场景级保真度仍待深入探索。本文旨在从场景图出发生成真实合理的3D场景。为增强给定场景图输入的表示能力,我们利用大语言模型显式聚合全局图特征与局部关系特征。通过统一的图卷积网络(GCN),从经联合布局-形状分布更新的场景图中提取图特征。在场景生成过程中,引入基于交并比(IoU)的正则化损失函数约束预测的三维布局。在SG-FRONT数据集上的基准测试表明,本方法实现了更优的3D场景合成,尤其在场景级保真度方面表现突出。相关源代码将在论文发表后公开。