The success of the text-guided diffusion model has inspired the development and release of numerous powerful diffusion models within the open-source community. These models are typically fine-tuned on various expert datasets, showcasing diverse denoising capabilities. Leveraging multiple high-quality models to produce stronger generation ability is valuable, but has not been extensively studied. Existing methods primarily adopt parameter merging strategies to produce a new static model. However, they overlook the fact that the divergent denoising capabilities of the models may dynamically change across different states, such as when experiencing different prompts, initial noises, denoising steps, and spatial locations. In this paper, we propose a novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages. Specifically, we design a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator that adaptive aggregates the block-wise intermediate features from multiple U-Net denoisers into a unified one. The core idea lies in dynamically producing an individual attention map for each model's features by comprehensively considering various states. It is worth noting that only SABW is trainable with about 50 million parameters, while other models are frozen. Both the quantitative and qualitative experiments demonstrate the effectiveness of our proposed Adaptive Feature Aggregation method. The code is available at https://github.com/tenvence/afa/.
翻译:文本引导扩散模型的成功激发了开源社区中众多强大扩散模型的开发与发布。这些模型通常在各类专家数据集上进行微调,展现出多样化的去噪能力。利用多个高质量模型以产生更强的生成能力具有重要价值,但尚未得到充分研究。现有方法主要采用参数合并策略来生成新的静态模型。然而,这些方法忽略了不同模型在去噪能力上的差异可能随状态动态变化的特性,例如在面对不同提示词、初始噪声、去噪步骤及空间位置时。本文提出一种新颖的集成方法——自适应特征聚合(AFA),该方法能够在特征层面根据多种状态(即提示词、初始噪声、去噪步骤和空间位置)动态调整多个模型的贡献度,从而保留多个扩散模型的优势并抑制其劣势。具体而言,我们设计了一个轻量级的空间感知分块特征聚合器,能够自适应地将多个U-Net去噪器的分块中间特征聚合为统一特征。其核心思想在于通过综合考虑多种状态,为每个模型的特征动态生成独立的注意力图。值得注意的是,仅SABW模块具有约5000万个可训练参数,其余模型均保持冻结状态。定量与定性实验均证明了我们提出的自适应特征聚合方法的有效性。代码发布于https://github.com/tenvence/afa/。