Recently, text-to-image generation has exhibited remarkable advancements, with the ability to produce visually impressive results. In contrast, text-to-3D generation has not yet reached a comparable level of quality. Existing methods primarily rely on text-guided score distillation sampling (SDS), and they encounter difficulties in transferring 2D attributes of the generated images to 3D content. In this work, we aim to develop an effective 3D generative model capable of synthesizing high-resolution textured meshes by leveraging both textual and image information. To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our model involves (1) generating sparse-view images of a text-consistent character using diffusion models, and (2) jointly optimizing multi-resolution differentiable marching tetrahedral grids with pixel-aligned image features. We further propose a similarity-aware feature fusion strategy for efficiently integrating features from different views. Moreover, we introduce two novel training objectives as an alternative to calculating SDS, significantly enhancing the optimization process. We thoroughly evaluate the performance and components of our framework, which outperforms the current state-of-the-art in producing topologically and structurally correct geometry and high-resolution textures. Guide3D enables the direct transfer of 2D-generated images to the 3D space. Our code will be made publicly available.
翻译:近期,文本到图像生成技术展现出显著进展,能够生成视觉惊艳的结果。相比之下,文本到三维生成技术尚未达到同等质量水平。现有方法主要依赖文本引导的分数蒸馏采样(SDS),但在将生成图像的二维属性迁移至三维内容时面临挑战。本研究旨在开发一种高效的三维生成模型,通过融合文本与图像信息,实现高分辨率纹理网格的合成。为此,我们提出Guide3D——一种基于扩散模型的零样本文本-图像联合引导三维虚拟人生成模型。该模型包含:(1) 利用扩散模型生成文本一致角色的稀疏视图图像;(2) 联合优化多分辨率可微分行进四面体网格与像素对齐图像特征。我们进一步提出相似性感知特征融合策略,以高效整合不同视角的特征。此外,我们引入两种新型训练目标替代传统SDS计算,显著优化了生成过程。通过全面评估框架性能与组件有效性,我们的方法在生成拓扑与结构正确的几何体及高分辨率纹理方面优于当前最先进技术。Guide3D实现了二维生成图像向三维空间的直接迁移。相关代码将开源发布。