Large-scale diffusion models like Stable Diffusion are powerful and find various real-world applications while customizing such models by fine-tuning is both memory and time inefficient. Motivated by the recent progress in natural language processing, we investigate parameter-efficient tuning in large diffusion models by inserting small learnable modules (termed adapters). In particular, we decompose the design space of adapters into orthogonal factors -- the input position, the output position as well as the function form, and perform Analysis of Variance (ANOVA), a classical statistical approach for analyzing the correlation between discrete (design options) and continuous variables (evaluation metrics). Our analysis suggests that the input position of adapters is the critical factor influencing the performance of downstream tasks. Then, we carefully study the choice of the input position, and we find that putting the input position after the cross-attention block can lead to the best performance, validated by additional visualization analyses. Finally, we provide a recipe for parameter-efficient tuning in diffusion models, which is comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth) with only 0.75 \% extra parameters, across various customized tasks.
翻译:像Stable Diffusion这样的大规模扩散模型功能强大,并在各种实际应用中得到应用,但通过微调来定制此类模型在内存和时间上效率低下。受自然语言处理领域最新进展的启发,我们通过在大型扩散模型中插入小型可学习模块(称为适配器)来研究参数高效微调。具体而言,我们将适配器的设计空间分解为正交因素——输入位置、输出位置以及函数形式,并采用方差分析(ANOVA)(一种分析离散(设计选项)与连续变量(评估指标)之间相关性的经典统计方法)进行研究。我们的分析表明,适配器的输入位置是影响下游任务性能的关键因素。随后,我们仔细研究了输入位置的选择,发现将输入位置置于交叉注意力模块之后可以获得最佳性能,并通过额外的可视化分析得到了验证。最后,我们提出了一种针对扩散模型的参数高效微调方案,该方案在各种定制任务中,仅需0.75%的额外参数,其性能即可与完全微调基线(例如DreamBooth)相媲美甚至更优。