GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars

High-fidelity 4D dynamic facial avatar reconstruction from monocular video is a critical yet challenging task, driven by increasing demands for immersive virtual human applications. While Neural Radiance Fields (NeRF) have advanced scene representation, their capacity to capture high-frequency facial details, such as dynamic wrinkles and subtle textures from information-constrained monocular streams, requires significant enhancement. To tackle this challenge, we propose a novel hybrid neural radiance field framework, called Geometry-Aware-Transformer Enhanced NeRF (GAT-NeRF) for high-fidelity and controllable 4D facial avatar reconstruction, which integrates the Transformer mechanism into the NeRF pipeline. GAT-NeRF synergistically combines a coordinate-aligned Multilayer Perceptron (MLP) with a lightweight Transformer module, termed as Geometry-Aware-Transformer (GAT) due to its processing of multi-modal inputs containing explicit geometric priors. The GAT module is enabled by fusing multi-modal input features, including 3D spatial coordinates, 3D Morphable Model (3DMM) expression parameters, and learnable latent codes to effectively learn and enhance feature representations pertinent to fine-grained geometry. The Transformer's effective feature learning capabilities are leveraged to significantly augment the modeling of complex local facial patterns like dynamic wrinkles and acne scars. Comprehensive experiments unequivocally demonstrate GAT-NeRF's state-of-the-art performance in visual fidelity and high-frequency detail recovery, forging new pathways for creating realistic dynamic digital humans for multimedia applications.

翻译：从单目视频重建高保真4D动态面部头像是一项关键且具有挑战性的任务，其驱动力源于对沉浸式虚拟人应用日益增长的需求。尽管神经辐射场（NeRF）在场景表示方面取得了进展，但其从信息受限的单目视频流中捕获高频面部细节（如动态皱纹和细微纹理）的能力仍需显著增强。为应对这一挑战，我们提出了一种新颖的混合神经辐射场框架，称为几何感知Transformer增强神经辐射场（GAT-NeRF），用于高保真且可控的4D面部头像重建。该框架将Transformer机制集成到NeRF流程中。GAT-NeRF协同结合了坐标对齐的多层感知机（MLP）与一个轻量级Transformer模块，该模块因其处理包含显式几何先验的多模态输入而被称为几何感知Transformer（GAT）。GAT模块通过融合多模态输入特征得以实现，这些特征包括3D空间坐标、3D可变形模型（3DMM）表情参数以及可学习的潜在编码，从而有效学习并增强与细粒度几何相关的特征表示。我们利用Transformer有效的特征学习能力，显著增强了对复杂局部面部模式（如动态皱纹和痤疮疤痕）的建模。全面的实验明确证明了GAT-NeRF在视觉保真度和高频细节恢复方面的最先进性能，为多媒体应用创建逼真的动态数字人开辟了新途径。