Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations. The code is available at https://github.com/luk-st/taba.
翻译:扩散模型在生成新样本方面取得了最先进的性能,但缺乏将数据编码为可编辑特征的低维潜在空间。基于反转的方法通过逆转去噪轨迹来解决这一问题,将图像转换为其近似起始噪声。在本工作中,我们深入分析了这一过程,并重点关注初始噪声、生成样本以及通过DDIM反转获得的相应潜在编码之间的关系。首先,我们证明潜在编码表现出结构性模式,表现为在平滑图像区域(例如纯净天空)预测的噪声多样性较低。通过一系列分析,我们将此问题追溯到最初的反转步骤,这些步骤未能提供准确且多样化的噪声。因此,DDIM反转空间的可操纵性明显低于原始噪声。我们表明,先前的反转方法并未完全解决此问题,但我们提出的简单修正——将最初的DDIM反转步骤替换为前向扩散过程——成功解除了潜在编码之间的相关性,从而实现了更高质量的编辑和插值。代码可在https://github.com/luk-st/taba获取。