While the image diffusion model has made significant strides in text-driven 3D content creation, it often falls short in accurately capturing the intended meaning of the text prompt, particularly with respect to direction information. This shortcoming gives rise to the Janus problem, where multi-faced 3D models are produced with the guidance of such diffusion models. In this paper, we present a robust pipeline for generating high-fidelity 3D content with orthogonal-view image guidance. Specifically, we introduce a novel 2D diffusion model that generates an image consisting of four orthogonal-view sub-images for the given text prompt. The 3D content is then created with this diffusion model, which enhances 3D consistency and provides strong structured semantic priors. This addresses the infamous Janus problem and significantly promotes generation efficiency. Additionally, we employ a progressive 3D synthesis strategy that results in substantial improvement in the quality of the created 3D contents. Both quantitative and qualitative evaluations show that our method demonstrates a significant improvement over previous text-to-3D techniques.
翻译:尽管图像扩散模型在文本驱动的3D内容生成领域取得了显著进展,但它在准确捕捉文本提示的语义(尤其是方向信息)方面仍存在不足。这一缺陷导致了“Janus问题”——即生成多面3D模型时出现的自相矛盾现象。本文提出了一种鲁棒性3D内容生成流程,通过正交视角图像引导实现高保真建模。具体而言,我们引入了一种新型2D扩散模型,该模型能够根据给定文本提示生成包含四个正交视角子图的图像。基于该扩散模型生成的3D内容不仅增强了3D一致性,还提供了强结构语义先验,从而解决了棘手的Janus问题,并显著提升了生成效率。此外,我们采用渐进式3D合成策略,极大地改善了生成3D内容的质量。定量与定性评估均表明,我们的方法相较以往的文本到3D技术具有显著优势。