Towards Practical Plug-and-Play Diffusion Models

Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without finetuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP.

翻译：基于扩散的生成模型在图像生成领域取得了显著成功。其引导公式允许外部模型以即插即用的方式控制生成过程，从而无需微调扩散模型即可完成各类任务。然而，直接使用公开的现成模型进行引导会因这些模型在含噪输入上的表现不佳而失败。为此，现有做法是使用噪声污染后的带标签数据对引导模型进行微调。本文指出，这种做法存在两方面局限性：（1）单一引导模型难以处理包含极多噪声类型的输入；（2）收集带标签数据集会阻碍模型向不同任务扩展的规模。针对这些局限，我们提出一种新策略：利用多个专家模型，每个专家专门处理特定噪声范围，并在其对应时间步引导扩散模型的逆向过程。然而，由于管理多个网络并使用带标签数据存在困难，我们提出一种名为实用即插即用（PPAP）的实用引导框架，该框架结合了参数高效微调与无数据知识迁移。我们在ImageNet类别条件生成实验中进行了详尽验证，证明本方法能以少量可训练参数，在无需任何带标签数据的情况下成功引导扩散过程。最后，我们展示了图像分类器、深度估计模型和语义分割模型可通过本框架以即插即用方式引导公开的GLIDE模型。我们的代码开源在 https://github.com/riiid/PPAP。