Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

翻译：人脸图像修复旨在恢复人脸图像中缺失或损坏的区域，同时保持身份特征、结构一致性和逼真的图像质量，这是一项专为照片修复设计的任务。尽管深度生成模型近年来取得了诸多进展，但现有方法在处理大面积不规则掩码时仍面临挑战，常因直接采用像素级合成方法及对面部先验信息利用不足，导致掩码区域边缘产生模糊纹理、语义不一致或面部结构不自然等问题。本文提出一种新颖的架构，通过语义引导的分层合成应对上述挑战。我们的方法首先采用基于语义的信息组织与合成策略，随后进行纹理细化。该流程在生成细节图像前，能清晰解析面部结构。在第一阶段，我们融合了两种技术：一种通过CNN聚焦局部特征，另一种通过Vision Transformer捕捉全局特征。这帮助我们生成清晰细致的语义布局。在第二阶段，我们采用多模态纹理生成器，通过提取多尺度信息来优化这些布局，确保视觉效果的连贯性与一致性。该架构通过动态注意力机制自然处理任意掩码配置，无需针对特定掩码进行训练。在CelebA-HQ和FFHQ两个数据集上的实验表明，我们的模型优于其他先进方法，在LPIPS、PSNR和SSIM等指标上均有提升。在大面积修复的挑战性场景中，该方法能生成视觉表现突出、语义保持更佳的结果。