Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. In this paper, we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model named StrDiffusion, to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting, while revealing: 1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage; 2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process, benefiting from the time-dependent sparsity of the structure semantics. For the denoising process, a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides, we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process, while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.

翻译：用于图像修复的去噪扩散概率模型旨在前向过程中向图像纹理添加噪声，并通过反向去噪过程利用未掩蔽区域的纹理恢复掩蔽区域。尽管现有方法能生成有意义的语义，但掩蔽区域与未掩蔽区域之间存在语义差异，这是因为语义密集的未掩蔽纹理在扩散过程中无法完全退化，而掩蔽区域则变为纯噪声，导致两者间差异显著。本文旨在探究未掩蔽语义如何引导纹理去噪过程，以及如何应对语义差异，以促进一致且有意义的语义生成。为此，我们提出了一种新颖的结构引导扩散模型StrDiffusion，在结构引导下重构传统纹理去噪过程，推导出简化的图像修复去噪目标，并揭示：1）语义稀疏的结构有助于在早期阶段应对语义差异，而密集纹理则在后期阶段生成合理语义；2）未掩蔽区域的语义本质上是为纹理去噪过程提供时间依赖的结构引导，其优势源于结构语义的时间依赖稀疏性。针对去噪过程，我们训练了一个结构引导神经网络，通过利用掩蔽与未掩蔽区域间去噪结构的一致性来估计简化去噪目标。此外，我们设计了一种自适应重采样策略作为结构是否胜任引导纹理去噪过程的正式标准，并调节其语义相关性。大量实验验证了StrDiffusion相较于现有方法的优势。我们的代码已开源：https://github.com/htyjers/StrDiffusion。