In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lips-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method. Please refer to our project page.
翻译:本文提出StyleLipSync,一种基于风格的个性化唇形同步视频生成模型,能够从任意音频生成与身份无关的唇形同步视频。为生成任意身份的视频,我们利用预训练StyleGAN语义丰富的潜空间中的表达性唇部先验,并通过线性变换设计视频一致性。与以往的唇形同步方法不同,我们引入姿态感知掩码,利用逐帧3D参数化网格预测器动态定位掩码,以提升帧间自然度。此外,我们提出针对任意人的少样本唇形同步适应方法,通过引入同步正则化器在保持唇形同步泛化能力的同时增强个体特定视觉信息。大量实验表明,即使在零样本设置下,我们的模型也能生成精确的唇形同步视频,并通过所提出的适应方法仅需目标视频的几秒即可增强未见人脸的特征。详情请参阅我们的项目页面。