Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.
翻译:尽管自神经辐射场(NeRF)引入以来,说话头合成领域已取得显著进展,但视觉伪影和高昂的训练成本仍是阻碍其大规模商业化应用的主要障碍。我们认为,识别并建立驱动信号与生成结果之间细粒度且可泛化的对应关系,能够同时解决这两个问题。本文提出LokiTalk,一种新颖的框架,旨在通过逼真的面部动态和提升的训练效率来增强基于NeRF的说话头合成。为实现细粒度对应,我们引入了区域特定形变场,将整体肖像运动分解为唇部动作、眨眼、头部姿态和躯干运动。通过两个级联形变场对驱动信号及其相关区域进行分层建模,我们显著提升了动态精度并最小化了合成伪影。此外,我们提出ID感知知识迁移,一种即插即用模块,该模块从多身份视频中学习可泛化的动态与静态对应关系,同时提取ID特定的动态与静态特征以优化个体角色的描绘。综合评估表明,与现有方法相比,LokiTalk在实现高保真结果和训练效率方面均表现更优。代码将在论文录用后开源。