Encoder-decoder based architecture has been widely used in the generator of generative adversarial networks for facial manipulation. However, we observe that the current architecture fails to recover the input image color, rich facial details such as skin color or texture and introduces artifacts as well. In this paper, we present a novel method named SARGAN that addresses the above-mentioned limitations from three perspectives. First, we employed spatial attention-based residual block instead of vanilla residual blocks to properly capture the expression-related features to be changed while keeping the other features unchanged. Second, we exploited a symmetric encoder-decoder network to attend facial features at multiple scales. Third, we proposed to train the complete network with a residual connection which relieves the generator of pressure to generate the input face image thereby producing the desired expression by directly feeding the input image towards the end of the generator. Both qualitative and quantitative experimental results show that our proposed model performs significantly better than state-of-the-art methods. In addition, existing models require much larger datasets for training but their performance degrades on out-of-distribution images. In contrast, SARGAN can be trained on smaller facial expressions datasets, which generalizes well on out-of-distribution images including human photographs, portraits, avatars and statues.
翻译:基于编码器-解码器架构的生成器在生成对抗网络的面部操纵中被广泛使用。然而,我们观察到当前架构在输入图像色彩恢复、肤色或纹理等丰富面部细节保持方面存在不足,并且会引入伪影。本文提出一种名为SARGAN的新方法,从三个角度解决了上述局限。首先,我们采用基于空间注意力的残差模块替代标准残差模块,以便在保留其他特征不变的同时,准确捕捉需要改变的表情相关特征。其次,我们利用对称的编码器-解码器网络在多个尺度上关注面部特征。第三,我们提出通过残差连接训练完整网络,直接将输入图像馈送至生成器末端,减轻生成器生成输入人脸图像的压力,从而产生所需表情。定性和定量实验结果表明,我们的模型显著优于当前最优方法。此外,现有模型需要更大的数据集进行训练,但其在分布外图像上的性能会下降。相比之下,SARGAN可以在较小的面部表情数据集上训练,并能在包含人物照片、肖像、虚拟形象和雕像等分布外图像上实现良好的泛化能力。