Face inpainting requires the model to have a precise global understanding of the facial position structure. Benefiting from the powerful capabilities of deep learning backbones, recent works in face inpainting have achieved decent performance in ideal setting (square shape with $512px$). However, existing methods often produce a visually unpleasant result, especially in the position-sensitive details (e.g., eyes and nose), when directly applied to arbitrary-shaped images in real-world scenarios. The visually unpleasant position-sensitive details indicate the shortcomings of existing methods in terms of position information processing capability. In this paper, we propose an \textbf{I}mplicit \textbf{N}eural \textbf{I}npainting \textbf{N}etwork (IN$^2$) to handle arbitrary-shape face images in real-world scenarios by explicit modeling for position information. Specifically, a downsample processing encoder is proposed to reduce information loss while obtaining the global semantic feature. A neighbor hybrid attention block is proposed with a hybrid attention mechanism to improve the facial understanding ability of the model without restricting the shape of the input. Finally, an implicit neural pyramid decoder is introduced to explicitly model position information and bridge the gap between low-resolution features and high-resolution output. Extensive experiments demonstrate the superiority of the proposed method in real-world face inpainting task.
翻译:人脸修复要求模型对面部位置结构具有精确的全局理解。得益于深度学习主干网络的强大能力,近期在人脸修复领域的工作已在理想条件下(512像素的正方形图像)取得了可观性能。然而,现有方法在直接应用于真实场景中的任意形状图像时,往往会产生视觉上令人不满意的结果,尤其是在位置敏感细节(如眼睛和鼻子)方面。这些视觉上不理想的位置敏感细节揭示了现有方法在位置信息处理能力上的不足。本文提出了一种隐式神经修复网络(IN²),通过对位置信息进行显式建模,以处理真实场景中的任意形状人脸图像。具体地,我们设计了一个下采样处理编码器,在获取全局语义特征的同时减少信息损失;提出了一种带有混合注意力机制的邻域混合注意力模块,在不限制输入形状的前提下提升模型的面部理解能力;最后,引入了一个隐式神经金字塔解码器,用于显式建模位置信息并弥合低分辨率特征与高分辨率输出之间的差距。大量实验证明了所提方法在真实人脸修复任务中的优越性。