In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This model not only aids NeRF optimization by considering the shading mode for improved geometry but also enhances texture from the coarse results to achieve superior refinement. Both aspects contribute to faithfully aligning the 3D content with the subject. Extensive experiments showcase the superiority of our method, Customize-It-3D, outperforming previous works by a substantial margin. It produces faithful 360-degree reconstructions with impressive visual quality, making it well-suited for various applications, including text-to-3D creation.
翻译:本文提出一种新颖的两阶段方法,通过充分利用参考图像蕴含的信息构建定制化先验知识,以驱动图像到三维内容的生成。现有方法主要依赖通用扩散先验,难以生成与参考图像一致的结果,对此我们提出一种主体专用的多模态扩散模型。该模型不仅通过考虑着色模式辅助NeRF优化以改善几何结构,还从粗粒度结果出发增强纹理细节以实现更优越的精细程度。这两方面协同作用,确保三维内容与主体高度对齐。大量实验表明,我们提出的Customize-It-3D方法显著优于现有工作,可生成忠实还原主体且视觉质量令人惊艳的360度重建结果,广泛适用于包括文本到三维生成在内的各类应用场景。