We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. For more results, please refer to our project page: http://jeff95.me/projects/avatarstudio.html
翻译:我们研究仅通过文本描述创建高保真且可动画化的三维虚拟形象问题。现有文本到虚拟形象方法要么局限于无法动画化的静态虚拟形象,要么难以在保证质量与精确姿态控制的前提下生成可动画化的虚拟形象。为突破这些局限,我们提出AvatarStudio——一种从粗到细的生成模型,可为可动画化人体虚拟形象生成带纹理的三维显式网格。具体而言,AvatarStudio首先采用低分辨率神经辐射场(NeRF)表征进行粗粒度生成,随后将SMPL引导的关节运动融入显式网格表征,以支持虚拟形象动画化与高分辨率渲染。为确保生成虚拟形象的视角一致性与姿态可控性,我们引入基于DensePose条件化的二维扩散模型,用于分数蒸馏采样(SDS)监督。通过有效利用关节网格表征与DensePose条件扩散模型的协同效应,AvatarStudio可从文本直接生成可直接动画化且质量显著超越先前方法的高品质虚拟形象。该模型还适用于多模态虚拟形象动画、风格引导虚拟形象生成等多种应用场景。更多结果详见项目主页:http://jeff95.me/projects/avatarstudio.html