Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.
翻译:去噪扩散模型能够实现复杂关系(如图像与文本)的条件生成与密度建模。然而,所学关系的本质具有不透明性,难以精确理解图像中词与区域之间究竟捕获了何种关系,也难以预测干预操作的效果。我们通过揭示扩散与信息分解之间的精确关系,阐明了扩散模型所学习的细粒度关系。互信息和条件互信息可以严格地用去噪模型表达。此外,点态估计量也能轻松估计,从而使我们能够探究特定图像与描述之间的关系。进一步分解信息以理解高维空间中哪些变量承载信息是一个长期存在的问题。对于扩散模型,我们证明互信息存在自然的非负分解,从而能够量化图像中词与像素之间的信息性关系。我们利用这些新关系来衡量扩散模型的组合理解能力,对图像中的对象进行无监督定位,并评估通过提示干预选择性编辑图像时的效果。