Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views. In this paper, we propose HumanRef, a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image, HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS), which effectively incorporates image guidance into the generation process. Furthermore, we introduce region-aware attention to Ref-SDS, ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry, photorealistic textures, and view-consistent appearances.
翻译:从单张参考图像生成三维人体模型的挑战在于,需在保持与参考图像一致性的同时,推断不可见视角的纹理与几何结构。现有利用三维生成模型的方法受限于三维训练数据的可用性。基于优化的方法将文本到图像扩散模型提升至三维生成时,往往难以保留参考图像的纹理细节,导致不同视角下外观不一致。本文提出HumanRef——一种基于单视角输入的三维人体生成框架。为确保生成的三维模型具有真实感且与输入图像一致,HumanRef引入名为参考引导分数蒸馏采样(Ref-SDS)的新方法,将图像引导有效融入生成过程。此外,我们在Ref-SDS中引入区域感知注意力机制,确保不同身体区域间的精确对应。实验结果表明,HumanRef在生成具有精细几何结构、真实感纹理及视角一致外观的三维着装人体方面,优于当前最先进方法。