While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, $\mathcal{W}_{ID}$, and Facial deformation latent, $\mathcal{S}_F$, that respectively reside in the $W+$ and $SS$ spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at $1024^2$. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach.
翻译:尽管近期研究借助StyleGAN的高保真人像生成逐步克服了一次性人脸视频再现的低分辨率限制,但这些方法至少依赖以下一种技术:显式二维/三维先验、基于光流扭曲的运动描述算子、现成编码器等,这限制了其性能(例如预测不一致、无法捕捉精细面部细节与配饰、泛化能力弱、产生伪影)。我们提出了一种端到端框架,可同时支持视频生成中的面部属性编辑、面部运动与形变控制及面部身份控制。该框架采用混合潜空间,将给定帧编码为一对潜变量:身份潜变量$\mathcal{W}_{ID}$与面部形变潜变量$\mathcal{S}_F$,分别位于StyleGAN2的$W+$空间和$SS$空间。由此融合了$W+$空间卓越的可编辑性与失真平衡特性,以及$SS$空间的高度解耦能力。这些混合潜变量借助StyleGAN2生成器可实现$1024^2$分辨率的高保真人脸视频再现。此外,该模型支持结合其他基于潜变量的语义编辑(如胡须、年龄、妆容等)生成逼真的再现视频。与现有最优方法的定性与定量分析表明,本方法具有显著优越性。