Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
翻译:图像到三维多实例生成中的精确空间保真度对于下游实际应用至关重要。近期工作尝试通过在多实例数据集上微调预训练图像到三维(I23D)模型来解决该问题,但这带来了大量训练开销且难以保证空间保真度。事实上,我们观察到预训练I23D模型已具备有意义的空间先验,而实例纠缠问题表明这些先验尚未被充分利用。基于此,我们提出TIMI——一种新颖的无训练图像到三维多实例生成框架,能够实现高空间保真度。具体而言,我们首先引入实例感知分离引导(ISG)模块,该模块在早期去噪阶段促进实例解纠缠。其次,为稳定ISG引入的引导,我们设计了空间稳定几何自适应更新(SGU)模块,该模块在维持实例相对关系的同时促进其几何特征保持。大量实验表明,与现有多实例方法相比,我们的方法在全局布局和局部实例差异性方面均表现更优,且无需额外训练,推理速度更快。