We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a novel data scheduling strategy and a weight consolidation regularization term, which improves the decoder's capability of rendering sharper details. Additionally, we optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on 46K avatars with a noise schedule optimized for triplanes, the resulting model can generate 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input.
翻译:本文提出RodinHD,一种能够从肖像图像生成高保真三维虚拟人的方法。现有方法难以捕捉如发型等复杂细节,本文致力于解决这一问题。我们首先发现,在多个虚拟人上顺序拟合三平面时,由于MLP解码器的参数共享机制,会导致灾难性遗忘这一被忽视的问题。为克服此问题,我们提出一种新颖的数据调度策略和权重巩固正则项,从而提升解码器渲染更锐利细节的能力。此外,我们通过计算能捕捉丰富二维纹理线索的细粒度层次化表示,并借助交叉注意力将其注入三维扩散模型的多个层级,从而优化了肖像图像的引导效果。在46K个虚拟人数据集上训练,并采用针对三平面优化的噪声调度后,所得模型生成的三维虚拟人在细节上显著优于先前方法,且能泛化至真实场景的肖像输入。