Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.
翻译:尽管生成建模领域近期取得了显著进展,但从文本提示高效生成高质量3D资产仍是一项艰巨任务。关键挑战在于数据稀缺性:最大规模的3D数据集仅包含数百万资产,而2D数据集则拥有数十亿文本-图像对。为解决这一问题,我们提出了一种利用大规模预训练2D扩散模型的新方法。具体而言,我们的方法HexaGen3D微调了一个预训练的文本到图像模型,以联合预测6个正交投影及其对应的潜在三平面。随后,我们对这些潜在表示进行解码以生成带纹理的网格。HexaGen3D无需逐样本优化,可在7秒内从文本提示推断出高质量且多样化的物体,在与现有方法对比时展现出显著更优的质量-延迟权衡。此外,HexaGen3D对新物体或组合展现出强大的泛化能力。