3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).
翻译:三维资产生成正受到广泛关注,这得益于近期文本引导的二维内容创作所取得的成功。现有的文本到三维方法通常在优化问题中使用预训练的文本到图像扩散模型,或在合成数据上对其进行微调,这往往导致生成缺乏背景的非真实感三维物体。本文提出一种方法,利用预训练的文本到图像模型作为先验,并通过真实世界数据学习在单次去噪过程中生成多视角图像。具体而言,我们提出将三维体渲染和跨帧注意力层集成到文本到图像模型现有U-Net网络的每个模块中。此外,我们设计了自回归生成机制,可在任意视角渲染出更具三维一致性的图像。我们在真实世界物体数据集上训练模型,并展示其生成具有多样化高质量形状与纹理、且处于真实环境中的实例的能力。与现有方法相比,本方法生成的结果具有更好的一致性,并展现出更优的视觉质量(FID降低30%,KID降低37%)。