Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
翻译:无模板可动画头部化身通过直接从主体捕获数据中学习表情依赖的面部形变,避免了参数化人脸模板和手工设计的混合形状空间,从而能够实现高视觉保真度。然而,由于学习的形变仅由单个身份观察到的表情进行监督,这些模型存在表情覆盖范围有限的问题,并且在由偏离训练分布的运动驱动时常常表现不佳。我们提出了RAF(检索增强面部),这是一种专为从数据中学习形变的无模板头部化身设计的简单训练时增强方法。RAF构建了一个大型无标签表情库,并在训练过程中,将主体表情特征的一个子集替换为从该库中检索到的最近邻表情,同时仍重建主体的原始帧。这使形变场暴露于更广泛的表情条件,促进了更强的身份-表情解耦,并提高了对表情分布偏移的鲁棒性,而无需跨身份配对数据、额外标注或架构更改。我们进一步分析了检索增强如何增加表情多样性,并通过用户研究验证了检索质量,表明检索到的邻居在表情和姿态上感知更接近。在NeRSemble基准测试上的实验表明,RAF在自驱动和跨驱动场景下均持续提升了相对于基线的表情保真度。