Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.
翻译:逼真的条件式三维场景合成极大地增强并加速了虚拟环境的创建,同时也能为计算机视觉和机器人研究等应用提供大量训练数据。扩散模型在相关应用中已展现出卓越性能,例如对无序集合进行精确排列。然而,这些模型在基于平面图条件的场景合成问题中尚未得到充分探索。本文提出MiDiffusion——一种新颖的混合离散-连续扩散模型架构,旨在根据给定的房间类型、平面图以及可能存在的预置物体合成合理的三维室内场景。我们通过二维平面图和一组物体来表示场景布局,每个物体由其类别、位置、尺寸和方向定义。我们的方法独特地在混合的离散语义域和连续几何域中实施结构化噪声扰动,从而为反向去噪步骤构建了条件更优的问题形式。我们在3D-FRONT数据集上评估了所提方法。实验结果表明,MiDiffusion在基于平面图条件的三维场景合成任务中显著优于当前最先进的自回归模型和扩散模型。此外,通过噪声扰动与掩码策略,我们的模型能够处理部分物体约束而无需针对特定任务进行训练。实验证明,MiDiffusion在场景补全与家具布局任务中较现有方法保持明显优势。