One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

翻译：文本引导的图像修复旨在根据文本提示重建被掩码区域，其长期存在的挑战在于保持未掩码区域的同时，实现未掩码区域与被修复掩码区域之间的语义一致性。先前的研究未能同时解决这两个问题，往往只能补救其中之一。据我们观察，这一现象源于编码不同图像属性的混合（例如中低频）频段的纠缠，这些频段在去噪过程中对文本提示表现出不同的鲁棒性。本文提出一种空文本-空频率感知的扩散模型，称为 \textbf{NTN-Diff}，用于文本引导的图像修复。该方法通过将掩码与未掩码区域间的语义一致性分解为各频段的一致性，同时保持未掩码区域，从而连续规避上述两个挑战。基于扩散过程，我们进一步将去噪过程划分为早期（高层噪声）和晚期（低层噪声）阶段，在此过程中解耦中低频段。实验观察到，在文本引导的去噪过程中，稳定的中频段被逐步去噪以实现语义对齐，同时其作为空文本去噪过程的指导，对掩码区域的低频段进行去噪，随后在晚期阶段进行后续的文本引导去噪，从而实现掩码与未掩码区域间中低频段的语义一致性，同时保持未掩码区域不变。大量实验验证了 NTN-Diff 在文本引导扩散模型方面优于当前最先进的扩散模型。我们的代码可通过 https://github.com/htyjers/NTN-Diff 获取。