The recent progress of diffusion models in terms of image quality has led to a major shift in research related to generative models. Current approaches often fine-tune pre-trained foundation models using domain-specific text-to-image pairs. This approach is straightforward for X-ray image generation due to the high availability of radiology reports linked to specific images. However, current approaches hardly ever look at attention layers to verify whether the models understand what they are generating. In this paper, we discover an important trade-off between image fidelity and interpretability in generative diffusion models. In particular, we show that fine-tuning text-to-image models with learnable text encoder leads to a lack of interpretability of diffusion models. Finally, we demonstrate the interpretability of diffusion models by showing that keeping the language encoder frozen, enables diffusion models to achieve state-of-the-art phrase grounding performance on certain diseases for a challenging multi-label segmentation task, without any additional training. Code and models will be available at https://github.com/MischaD/chest-distillation.
翻译:扩散模型在图像质量方面的最新进展引发了生成模型研究的重大转变。当前方法通常使用领域特定的文本-图像对来微调预训练的基础模型。由于放射学报告与特定图像的高度可用性,这种方法在X光图像生成中非常直接。然而,当前方法几乎从不检查注意力层以验证模型是否理解其生成的内容。在本文中,我们发现生成式扩散模型在图像保真度与可解释性之间存在重要权衡。特别地,我们证明,使用可学习的文本编码器微调文本到图像模型会导致扩散模型缺乏可解释性。最后,我们通过展示冻结语言编码器可使扩散模型在无需额外训练的情况下,在具有挑战性的多标签分割任务中针对某些疾病实现最先进的短语定位性能,从而证明了扩散模型的可解释性。代码和模型将在https://github.com/MischaD/chest-distillation提供。