Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion

As text-to-image models grow increasingly powerful and complex, their burgeoning size presents a significant obstacle to widespread adoption, especially on resource-constrained devices. This paper presents a pioneering study on post-training pruning of Stable Diffusion 2, addressing the critical need for model compression in text-to-image domain. Our study tackles the pruning techniques for the previously unexplored multi-modal generation models, and particularly examines the pruning impact on the textual component and the image generation component separately. We conduct a comprehensive comparison on pruning the model or the single component of the model in various sparsities. Our results yield previously undocumented findings. For example, contrary to established trends in language model pruning, we discover that simple magnitude pruning outperforms more advanced techniques in text-to-image context. Furthermore, our results show that Stable Diffusion 2 can be pruned to 38.5% sparsity with minimal quality loss, achieving a significant reduction in model size. We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%. This configuration maintains image generation quality while substantially reducing computational requirements. In addition, our work uncovers intriguing questions about information encoding in text-to-image models: we observe that pruning beyond certain thresholds leads to sudden performance drops (unreadable images), suggesting that specific weights encode critical semantics information. This finding opens new avenues for future research in model compression, interoperability, and bias identification in text-to-image models. By providing crucial insights into the pruning behavior of text-to-image models, our study lays the groundwork for developing more efficient and accessible AI-driven image generation systems

翻译：随着文本到图像模型日益强大和复杂，其迅速增长的规模对广泛部署构成了显著障碍，尤其在资源受限的设备上。本文针对Stable Diffusion 2提出了一项开创性的训练后剪枝研究，以应对文本到图像领域对模型压缩的迫切需求。本研究攻克了先前未被探索的多模态生成模型的剪枝技术，并特别分别考察了剪枝对文本组件和图像生成组件的影响。我们在不同稀疏度下对完整模型或单一组件进行剪枝，开展了全面对比分析。我们的实验结果揭示了此前未被记录的发现：例如，与语言模型剪枝的既定趋势相反，我们发现在文本到图像场景中，简单的幅度剪枝优于更先进的技术。此外，实验结果表明Stable Diffusion 2可在保持最小质量损失的前提下实现38.5%的稀疏度，从而显著缩减模型规模。我们提出了一种最优剪枝配置方案：将文本编码器剪枝至47.5%稀疏度，扩散生成器剪枝至35%稀疏度。该配置在维持图像生成质量的同时，大幅降低了计算需求。值得注意的是，我们的工作还揭示了关于文本到图像模型信息编码的有趣问题：我们观察到超过特定阈值的剪枝会导致性能骤降（生成不可读图像），这表明特定权重编码了关键语义信息。这一发现为未来在文本到图像模型的压缩、互操作性及偏差识别方面的研究开辟了新途径。通过对文本到图像模型剪枝行为提供关键见解，本研究为开发更高效、更易获取的AI驱动图像生成系统奠定了基础。