Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.
翻译:去噪扩散模型能够实现图像与文本等复杂关系的条件生成与密度建模。然而,所学关系的本质不透明,导致难以精确理解词与图像区域间究竟捕获了何种关系,也无法预测干预操作的影响。本研究通过揭示扩散与信息分解之间的精确联系,阐明了扩散模型学到的细粒度关系。互信息与条件互信息可借助去噪模型写为精确表达式。此外,点态估计亦易于计算,从而能够探究特定图像与文本描述之间的关系。在信息分解中进一步理解高维空间中的变量携带何种信息,是一个长期存在的难题。对于扩散模型,我们证明互信息会出现一种自然的非负分解,使我们能够量化图像中词与像素之间的信息性关系。我们利用这些新关系来度量扩散模型的组合理解能力,实现图像中对象的无监督定位,并测量通过提示干预选择性编辑图像时的效应。