We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de.
翻译:我们提出SCULPT,一种面向穿衣人体带纹理三维网格的新型三维生成模型。具体而言,我们设计了一种深度神经网络,用于学习穿衣人体的几何与外观分布表征。此类模型的训练极具挑战性,因为人体带纹理三维网格数据集在规模和可获取性方面均十分有限。我们的关键发现是:存在中等规模的CAPE等三维扫描数据集,以及大规模穿衣人体二维图像数据集,且多种外观可映射至单一几何结构。为有效利用这两种数据模态,我们提出一种针对姿态相关穿衣人体带纹理网格的非配对学习流程。具体而言,我们从三维扫描数据中学习姿态相关的几何空间,并将其表示为相对于SMPL模型的逐顶点位移量。随后,我们利用二维图像数据以无监督方式训练几何条件纹理生成器,并利用所学几何模型的中间激活层作为纹理生成器的条件输入。为缓解姿态与服装类型、姿态与服装外观之间的耦合效应,我们使用属性标签对纹理生成器和几何生成器进行条件约束——几何生成器以服装类型为条件,纹理生成器以服装颜色为条件。这些二维图像的条件标签基于视觉问答模型BLIP和CLIP自动生成。我们在SCULPT数据集上验证了该方法,并与当前最先进的穿衣人体三维生成模型进行了对比。我们的代码与数据可通过https://sculpt.is.tue.mpg.de获取。