Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textured mesh from a single image or text prompts. We firstly utilize a multi-view diffusion model to generate sparse multi-view inputs from single images or text prompts, and then a transformer-based model is trained to predict a tensorial SDF field from these sparse multi-view image inputs. Finally, we employ a gradient-based mesh optimization layer to refine this model, enabling it to produce an SDF field from which high-quality textured meshes can be extracted. Extensive experiments demonstrate that our method can generate diverse, high-quality 3D mesh assets with corresponding decomposed RGB textures within seconds.
翻译:先前的研究已成功实现从文本或图像生成可直接用于生产的3D资产。然而,这些方法主要采用NeRF或3D高斯表示,它们难以生成现代渲染管线所需的平滑高质量几何体。本文提出LDM——一种新颖的前馈框架,能够从单张图像或文本提示生成高保真、光照解耦的纹理化网格。我们首先利用多视角扩散模型从单张图像或文本提示生成稀疏多视角输入,随后训练基于transformer的模型从这些稀疏多视角图像输入预测张量SDF场。最后,我们采用基于梯度的网格优化层对该模型进行细化,使其能够生成可提取高质量纹理化网格的SDF场。大量实验表明,我们的方法能在数秒内生成具有对应分解RGB纹理的多样化高质量3D网格资产。