Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.
翻译:扩散模型在合成高质量图像方面取得了显著成功。然而,由于巨大的计算成本,使用扩散模型生成高分辨率图像仍面临挑战,这导致交互式应用中出现难以承受的延迟。本文提出DistriFusion方法,通过利用多GPU并行性解决该问题。我们的方法将模型输入划分为多个块,并将每个块分配给一个GPU。然而,简单地实现这种算法会破坏块之间的交互并损失保真度,而引入这种交互将产生巨大的通信开销。为克服这一困境,我们观察到相邻扩散步骤输入之间的高度相似性,并提出位移块并行方法,该方法利用扩散过程的序列特性,通过重用前一时间步预计算的特征图为当前步提供上下文。因此,我们的方法支持异步通信,该通信可与计算流水线化。大量实验表明,我们的方法可无缝应用于最新的Stable Diffusion XL,在无质量损失的情况下,用八块NVIDIA A100相比单块实现高达6.1倍的加速。我们的代码已在https://github.com/mit-han-lab/distrifuser 公开。