Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at https://github.com/megvii-research/MegFaceAnimate.
翻译:尽管在人像动画领域中,原始驱动视频包含比中间表示(如面部关键点)更丰富的面部表情信息,但它们却很少成为研究的焦点。这主要源于使用原始视频驱动人像动画所固有的两个挑战:1)严重的身份信息泄漏;2)不相关的背景和面部细节(如皱纹)会降低性能。为了利用原始视频实现生动的人像动画,我们提出了一个开创性的条件扩散模型,命名为MegActor。首先,我们引入了一个合成数据生成框架,用于创建具有一致运动和表情但身份ID不一致的视频,以缓解ID泄漏问题。其次,我们对参考图像的前景和背景进行分割,并利用CLIP对背景细节进行编码。这些编码信息随后通过一个文本嵌入模块整合到网络中,从而确保背景的稳定性。最后,我们进一步将参考图像的外观风格迁移到驱动视频上,以消除驱动视频中面部细节的影响。我们的最终模型仅在公开数据集上进行训练,取得了与商业模型相媲美的结果。我们希望这能对开源社区有所帮助。代码可在 https://github.com/megvii-research/MegFaceAnimate 获取。