One-shot Implicit Animatable Avatars with Model-based Priors

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pretrained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://huangyangyi.github.io/ELICIT/.

翻译：现有的基于神经渲染的人体虚拟化身创建方法通常需要密集输入信号（如视频或多视角图像），或依赖从大规模特定三维人体数据集中学习的先验知识，以实现稀疏视角输入下的重建。然而，当仅有一张单张图像可用时，这些方法大多无法实现逼真的重建。为实现数据高效的逼真可动画化三维人体创建，我们提出ELICIT，一种从单张图像学习人体专用神经辐射场的新方法。受人类能轻松从单张图像估计身体几何并想象全身衣着的启发，我们在ELICIT中利用两种先验：三维几何先验和视觉语义先验。具体而言，ELICIT采用基于蒙皮顶点模板模型（即SMPL）的三维体形几何先验，并通过CLIP预训练模型实现视觉衣着语义先验。两种先验共同指导优化，为不可见区域生成合理内容。借助CLIP模型，ELICIT可利用文本描述生成文本条件约束的未观察区域。为进一步提升视觉细节，我们提出基于分割的采样策略，对虚拟化身的不同部位进行局部细化。在ZJU-MoCAP、Human3.6M和DeepFashion等多个主流基准上的综合评估表明，当仅有一张单张图像可用时，ELICIT已超越强基线虚拟化身创建方法。代码已公开用于科研目的，链接为https://huangyangyi.github.io/ELICIT/。