In this work, we introduce a new approach for artistic face stylization. Despite existing methods achieving impressive results in this task, there is still room for improvement in generating high-quality stylized faces with diverse styles and accurate facial reconstruction. Our proposed framework, MMFS, supports multi-modal face stylization by leveraging the strengths of StyleGAN and integrates it into an encoder-decoder architecture. Specifically, we use the mid-resolution and high-resolution layers of StyleGAN as the decoder to generate high-quality faces, while aligning its low-resolution layer with the encoder to extract and preserve input facial details. We also introduce a two-stage training strategy, where we train the encoder in the first stage to align the feature maps with StyleGAN and enable a faithful reconstruction of input faces. In the second stage, the entire network is fine-tuned with artistic data for stylized face generation. To enable the fine-tuned model to be applied in zero-shot and one-shot stylization tasks, we train an additional mapping network from the large-scale Contrastive-Language-Image-Pre-training (CLIP) space to a latent $w+$ space of fine-tuned StyleGAN. Qualitative and quantitative experiments show that our framework achieves superior face stylization performance in both one-shot and zero-shot stylization tasks, outperforming state-of-the-art methods by a large margin.
翻译:在本工作中,我们提出了一种面向艺术人脸风格化的新方法。尽管现有方法在此任务中已取得显著成果,但在生成高质量、风格多样且人脸重建准确的艺术人脸方面仍有改进空间。我们提出的框架MMFS通过利用StyleGAN的优势支持多模态人脸风格化,并将其集成至编码器-解码器架构中。具体而言,我们采用StyleGAN的中分辨率和高分辨率层作为解码器以生成高质量人脸,同时将其低分辨率层与编码器对齐,从而提取并保留输入人脸的细节特征。此外,我们引入两阶段训练策略:第一阶段训练编码器,使其特征图与StyleGAN对齐,实现输入人脸的忠实重建;第二阶段利用艺术数据对整个网络进行微调,以生成风格化人脸。为使得微调模型可应用于零样本及单样本风格化任务,我们额外训练了一个映射网络,该网络将大规模对比语言-图像预训练(CLIP)空间映射至微调StyleGAN的潜在$w+$空间。定性及定量实验表明,我们的框架在单样本与零样本风格化任务中均实现了卓越的人脸风格化性能,显著超越现有最优方法。