Learning-based methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods struggle to generate indoor scenes with realistic object arrangements and styles. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, to enhance the realism of indoor scene synthesis. S-INF assumes that the scene layout is often related to the object-detailed information. It disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields (INFs). By learning specialized scene layout relationships and projecting them into S-INF, we achieve a realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different types of ISS.
翻译:基于学习的方法在三维室内场景合成(ISS)领域日益流行,其性能已超越传统的基于优化的方法。这类基于学习的方法通常利用生成模型对简单但显式的场景表示进行分布建模。然而,由于过度简化的显式表示忽略了细节信息,且缺乏对场景内多模态关系的引导,大多数基于学习的方法难以生成具有逼真物体布局与风格的室内场景。本文提出一种用于室内场景合成的新方法——场景隐式神经场(S-INF),旨在学习多模态关系的有意义表示,以提升室内场景合成的真实感。S-INF假设场景布局通常与物体的细节信息相关联。它将多模态关系解耦为场景布局关系与物体细节关系,随后通过隐式神经场(INFs)进行融合。通过学习专门的场景布局关系并将其投影至S-INF,我们实现了场景布局的逼真生成。此外,S-INF通过可微分渲染捕获密集且细致的物体关系,确保跨物体的风格一致性。通过在基准数据集3D-FRONT上进行大量实验,我们证明本方法在不同类型的ISS任务中均能持续取得最先进的性能。