We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
翻译:我们提出了X-Adapter,一种通用升级器,使得预训练的即插即用模块(例如ControlNet、LoRA)能够直接与升级后的文本到图像扩散模型(例如SDXL)协同工作,而无需进一步重新训练。我们通过训练一个附加网络,利用新的文本-图像数据对来控制冻结的升级模型,从而实现这一目标。具体而言,X-Adapter保留了一个旧模型的冻结副本,以维持不同插件的连接器。此外,X-Adapter添加了可训练的映射层,这些层桥接了不同版本模型的解码器,用于特征重映射。重映射后的特征将作为升级模型的引导。为了增强X-Adapter的引导能力,我们对升级模型采用了空文本训练策略。训练完成后,我们还引入了一种两阶段去噪策略,以对齐X-Adapter和升级模型的初始潜在表示。得益于我们的策略,X-Adapter展示了与各种插件的通用兼容性,并且还使得不同版本的插件能够协同工作,从而扩展了扩散社区的功能。为了验证所提出方法的有效性,我们进行了广泛的实验,结果表明X-Adapter可能在升级的基础扩散模型中促进更广泛的应用。