Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.
翻译:近年来,文本到三维生成技术通过从强大的大规模文本到图像扩散模型中蒸馏知识而取得显著进展。然而,现有方法常面临过度饱和、细节不足及生成结果不真实等挑战。本研究提出一种利用显式合成的多视图图像来应对这些问题的创新策略。我们的方法借助大规模扩散模型驱动的图像到图像管线,基于粗糙三维模型的渲染结果生成包含姿态信息的高质量图像。尽管生成的图像多数能缓解上述问题,但由于大扩散模型的固有生成特性,视图不一致与内容显著变异等问题依然存在,导致有效利用这些图像面临诸多困难。为突破这一瓶颈,我们提出在三维模型训练中引入判别器,并配合创新的扩散-生成对抗网络双训练策略进行引导。对于所集成的判别器,合成的多视图图像被视为真实数据,而优化中三维模型的渲染结果则作为伪数据。通过全面实验验证,本方法在基线方法基础上展现出显著有效性。