Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.
翻译:现有的人体NeRF方法通常依赖多视角相机的多张二维图像或固定视角的单目视频来重建三维人体。然而在真实场景中,人体图像常由随机视角相机拍摄,这给高质量三维人体重建带来了挑战。本文提出SHERF——首个从单张输入图像中恢复可驱动三维人体的可泛化NeRF模型。SHERF在规范空间中提取并编码三维人体表征,实现自由视角与姿态下的渲染与动画。为实现高保真的新视角与姿态合成,编码后的三维人体表征需要同时捕捉全局外观与局部精细纹理。为此,我们提出包含全局特征、点级特征和像素对齐特征的3D感知层次化特征库,以增强信息编码能力。全局特征可增强单张输入图像提取的信息量,并补全部分二维观测缺失的信息;点级特征提供三维人体结构的强线索;像素对齐特征则保留更多细粒度细节。为有效整合该3D感知层次化特征库,我们设计了特征融合Transformer。在THuman、RenderPeople、ZJU_MoCap和HuMMan数据集上的大量实验表明,SHERF实现了最先进的性能,并在新视角与姿态合成方面具备更优的泛化能力。