Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

Face inpainting is important in various applications, such as photo restoration, image editing, and virtual reality. Despite the significant advances in face generative models, ensuring that a person's unique facial identity is maintained during the inpainting process is still an elusive goal. Current state-of-the-art techniques, exemplified by MyStyle, necessitate resource-intensive fine-tuning and a substantial number of images for each new identity. Furthermore, existing methods often fall short in accommodating user-specified semantic attributes, such as beard or expression. To improve inpainting results, and reduce the computational complexity during inference, this paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models. Specifically, we insert parallel attention matrices to each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder. We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting. Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, in comparison to various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. Our findings reveal that PVA ensures good identity preservation while offering effective language-controllability. Additionally, in contrast to Custom Diffusion, PVA requires just 40 fine-tuning steps for each new identity, which translates to a significant speed increase of over 20 times.

翻译：人脸修复在照片修复、图像编辑和虚拟现实等应用中具有重要意义。尽管人脸生成模型取得了显著进展，但在修复过程中保持个体独特的面部身份特征仍是一个棘手的目标。当前以MyStyle为代表的先进技术需要为每个新身份进行资源密集型的微调和大量图像。此外，现有方法往往难以满足用户指定的语义属性（如胡须或表情）。为改进修复结果并降低推理时的计算复杂度，本文提出将并行视觉注意力（PVA）与扩散模型相结合。具体而言，我们在去噪网络的每个交叉注意力模块中插入并行注意力矩阵，这些矩阵关注由身份编码器从参考图像中提取的特征。我们在CelebAHQ-IDI数据集（一个为身份保持人脸修复提出的数据集）上训练新增的注意力模块和身份编码器。实验表明，与包括MyStyle、Paint by Example和Custom Diffusion在内的多种基准方法相比，PVA在纯人脸修复及语言引导人脸修复任务中均实现了无与伦比的身份相似度。研究结果揭示，PVA在提供有效语言可控性的同时，确保了良好的身份保持。此外，与Custom Diffusion相比，PVA仅需40步微调即可适应每个新身份，这相当于实现了超过20倍的显著速度提升。