Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.
翻译:近期,利用2D和3D扩散模型,从文本提示生成三维内容取得了显著进展。尽管3D扩散模型能确保良好的多视角一致性,但受限于三维数据的稀缺性,其生成高质量、多样性三维资产的能力受到制约。相比之下,2D扩散模型通过蒸馏方法无需三维数据即可实现出色的泛化能力和丰富细节。然而,2D提升方法存在固有的视角无关歧义性,导致严重的多面Janus问题——文本提示无法提供足够引导来学习连贯的三维结果。本文不重新训练代价高昂的视角感知模型,而是研究如何充分利用易于获取的粗粒度三维知识来增强提示信息,并指导2D提升优化以实现精细化生成。为此,我们提出Sherpa3D——一种同时实现高保真度、泛化能力和几何一致性的新文本到三维框架。具体而言,我们基于3D扩散模型生成的粗粒度三维先验,设计了两类引导策略:用于几何保真度的结构引导和用于三维一致性的语义引导。通过这两类引导,2D扩散模型能生成多样化且高质量的三维内容。大量实验表明,我们的Sherpa3D在质量与三维一致性方面均优于当前最先进的文本到三维方法。