We propose a 3D generation pipeline that uses diffusion models to generate realistic human digital avatars. Due to the wide variety of human identities, poses, and stochastic details, the generation of 3D human meshes has been a challenging problem. To address this, we decompose the problem into 2D normal map generation and normal map-based 3D reconstruction. Specifically, we first simultaneously generate realistic normal maps for the front and backside of a clothed human, dubbed dual normal maps, using a pose-conditional diffusion model. For 3D reconstruction, we "carve" the prior SMPL-X mesh to a detailed 3D mesh according to the normal maps through mesh optimization. To further enhance the high-frequency details, we present a diffusion resampling scheme on both body and facial regions, thus encouraging the generation of realistic digital avatars. We also seamlessly incorporate a recent text-to-image diffusion model to support text-based human identity control. Our method, namely, Chupa, is capable of generating realistic 3D clothed humans with better perceptual quality and identity variety.
翻译:摘要:我们提出了一种三维生成流程,利用扩散模型生成逼真的数字人类化身。由于人类身份、姿态和随机细节的多样性,三维人体网格的生成一直是一个具有挑战性的问题。为解决这一问题,我们将问题分解为二维法向图生成和基于法向图的三维重建。具体而言,我们首先使用姿态条件扩散模型同时生成穿衣人体正面和背面的逼真法向图(称为双面法向图)。对于三维重建,我们根据法向图通过网格优化从先验SMPL-X网格中“雕刻”出详细的三维网格。为进一步提升高频细节,我们在身体和面部区域提出了一种扩散重采样方案,从而促进逼真数字化身的生成。我们还无缝集成了最新的文本到图像扩散模型,以支持基于文本的人体身份控制。我们的方法称为Chupa,能够生成具有更好感知质量和身份多样性的逼真三维穿衣人体。