Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.
翻译:说话人脸生成技术可从任意外观与运动信号生成说话视频,其“任意性”虽提升了易用性,却在实际应用中引入了挑战。现有方法在标准输入下表现良好,但在复杂真实场景中性能显著退化。此外,部署时的效率问题同样至关重要。为系统解决这些问题,我们提出SuperFace——一种平衡质量、鲁棒性、成本与可编辑性的师生框架。首先提出一个简单高效的教师模型,能够处理不同质量的输入并生成高质量结果。在此基础上,我们设计了一种高效的蒸馏策略,以获得保持质量但计算量显著降低的身份特定学生模型。实验证明,SuperFace在所述四个目标上提供了比现有方法更全面的解决方案,尤其通过学生模型将计算量(FLOPs)降低99%。SuperFace可由视频和音频驱动,并支持局部面部属性编辑。