The SOTA face swap models still suffer the problem of either target identity (i.e., shape) being leaked or the target non-identity attributes (i.e., background, hair) failing to be fully preserved in the final results. We show that this insufficient disentanglement is caused by two flawed designs that were commonly adopted in prior models: (1) counting on only one compressed encoder to represent both the semantic-level non-identity facial attributes(i.e., pose) and the pixel-level non-facial region details, which is contradictory to satisfy at the same time; (2) highly relying on long skip-connections between the encoder and the final generator, leaking a certain amount of target face identity into the result. To fix them, we introduce a new face swap framework called 'WSC-swap' that gets rid of skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. To further reinforce the disentanglement learning for the target encoder, we employ both identity removal loss via adversarial training (i.e., GAN) and the non-identity preservation loss via prior 3DMM models like [11]. Extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before.
翻译:当前最先进的人脸交换模型仍然存在最终结果中目标身份(即形状)泄露或目标非身份属性(如背景、发型)未能完全保留的问题。我们证明这种不充分解耦源于先前模型普遍采用的两种有缺陷的设计:(1)仅依赖单一压缩编码器同时表示语义级非身份面部属性(如姿态)和像素级非面部区域细节,这难以同时满足需求;(2)高度依赖编码器与最终生成器之间的长跳跃连接,导致目标人脸身份信息泄露至结果中。为解决这些问题,我们提出名为"WSC-swap"的新型人脸交换框架,该框架移除跳跃连接,并使用两个目标编码器分别捕捉像素级非面部区域属性和面部区域中的语义级非身份属性。为进一步强化目标编码器的解耦学习,我们通过对抗训练(即GAN)引入身份去除损失,并借助[11]等先验3DMM模型引入非身份保留损失。在FaceForensics++和CelebA-HQ上的大量实验表明,我们的结果在多项指标上显著优于先前工作,其中包括一项此前完全被忽视的用于测量身份一致性的新型指标。