In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.
翻译:本文研究了一个新颖且具有挑战性的问题:基于文本驱动生成具有高质量纹理的三维服装。我们提出了“WordRobe”,这是一个从用户友好的文本提示生成无姿态、带纹理三维服装网格的新框架。我们首先通过一种新颖的由粗到精训练策略和一种促进潜在解纠缠的损失函数,学习三维服装的潜在表示,从而实现更好的潜在插值。随后,我们以弱监督方式将服装潜在空间与CLIP嵌入空间对齐,从而实现文本驱动的三维服装生成与编辑。在外观建模方面,我们利用ControlNet的零样本生成能力,在单次前向推理步骤中合成视角一致的纹理贴图,从而相比现有方法大幅缩短了生成时间。通过定量评估和定性用户研究,我们证明了本方法在学习三维服装潜在空间、服装插值以及文本驱动纹理合成方面优于当前的最先进技术。使用WordRobe生成的无姿态三维服装网格可以直接输入到标准的布料仿真与动画流程中,无需任何后处理。