Diffusion models have demonstrated remarkable performance in generation tasks. Nevertheless, explaining the diffusion process remains challenging due to it being a sequence of denoising noisy images that are difficult for experts to interpret. To address this issue, we propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable. We show how the output is progressively generated in the diffusion process by explaining the level of denoising and highlighting relationships to foundational visual concepts at each time step through the results of experiments with various visual analyses using the tools. Throughout the training of the diffusion model, the model learns diverse visual concepts corresponding to each time-step, enabling the model to predict varying levels of visual concepts at different stages. We substantiate our tools using Area Under Cover (AUC) score, correlation quantification, and cross-attention mapping. Our findings provide insights into the diffusion process and pave the way for further research into explainable diffusion mechanisms.
翻译:扩散模型在生成任务中展现出卓越性能。然而,由于扩散过程是一系列去噪图像的序列,且这些图像难以被专家解释,因此对其解释仍具挑战性。为解决此问题,我们提出三个研究问题,从模型生成的视觉概念以及每个时间步模型关注的区域角度来解释扩散过程。我们设计了扩散过程可视化工具来回答上述研究问题,从而使扩散过程具有人类可理解性。通过使用这些工具进行多种可视化分析的实验结果,我们展示了在扩散过程中输出如何逐步生成,通过解释每个时间步的去噪程度并突出其与基础视觉概念的关系。在扩散模型的训练过程中,模型学习到与每个时间步对应的多样化视觉概念,使其能够在不同阶段预测不同层次的视觉概念。我们利用曲线下面积分数、相关量化及交叉注意力映射验证了所提工具的有效性。研究结果深入揭示了扩散过程的内在机制,为可解释扩散机制的进一步研究奠定了基础。