This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
翻译:本文提出MIDI,一种从单张图像生成组合式三维场景的新范式。与现有依赖重建或检索技术的方法不同,亦区别于近期采用多阶段逐对象生成的方案,MIDI将预训练的图像到三维对象生成模型扩展为多实例扩散模型,能够同时生成多个具有精确空间关系且泛化能力强的三维实例。其核心在于引入了一种新颖的多实例注意力机制,该机制直接在生成过程中有效捕获对象间交互与空间连贯性,无需复杂的多步骤流程。该方法以局部对象图像和全局场景上下文作为输入,在三维生成过程中直接建模对象补全。训练阶段,我们利用有限量的场景级数据有效监督三维实例间的交互,同时引入单对象数据进行正则化,从而保持预训练模型的泛化能力。MIDI在图像到场景生成任务中展现出最先进的性能,该结论通过合成数据、真实场景数据以及文本到图像扩散模型生成的风格化场景图像评估得到验证。