We develop theory to understand an intriguing property of diffusion models for image generation that we term critical windows. Empirically, it has been observed that there are narrow time intervals in sampling during which particular features of the final image emerge, e.g. the image class or background color (Ho et al., 2020b; Georgiev et al., 2023; Raya & Ambrogioni, 2023; Sclocchi et al., 2024; Biroli et al., 2024). While this is advantageous for interpretability as it implies one can localize properties of the generation to a small segment of the trajectory, it seems at odds with the continuous nature of the diffusion. We propose a formal framework for studying these windows and show that for data coming from a mixture of strongly log-concave densities, these windows can be provably bounded in terms of certain measures of inter- and intra-group separation. We also instantiate these bounds for concrete examples like well-conditioned Gaussian mixtures. Finally, we use our bounds to give a rigorous interpretation of diffusion models as hierarchical samplers that progressively "decide" output features over a discrete sequence of times. We validate our bounds with synthetic experiments. Additionally, preliminary experiments on Stable Diffusion suggest critical windows may serve as a useful tool for diagnosing fairness and privacy violations in real-world diffusion models.
翻译:我们发展理论来理解图像生成扩散模型的一个引人注目的特性,即所谓的“关键窗口”。实验观察到,在采样过程中存在狭窄的时间区间,在此期间最终图像的特定特征(如图像类别或背景颜色)会涌现(Ho等人,2020b;Georgiev等人,2023;Raya & Ambrogioni,2023;Sclocchi等人,2024;Biroli等人,2024)。虽然这一特性有利于可解释性,因为它意味着可以将生成属性定位到轨迹的一小段,但这似乎与扩散的连续性质相矛盾。我们提出了一个研究这些窗口的形式化框架,并证明对于来自强对数凹密度混合的数据,这些窗口的长度可以通过组间和组内分离的特定度量来得到可证明的界。我们还针对具体实例(如条件良好的高斯混合)给出了这些界的具体形式。最后,我们利用这些界为扩散模型提供了一种严格的解释:将其视为分层采样器,在离散的时间序列上逐步“决策”输出特征。我们通过合成实验验证了这些界。此外,在Stable Diffusion上的初步实验表明,关键窗口可能成为诊断现实世界扩散模型中公平性和隐私侵犯的有用工具。