We propose an effective method for inserting adapters into text-to-image foundation models, which enables the execution of complex downstream tasks while preserving the generalization ability of the base model. The core idea of this method is to optimize the attention mechanism related to 2D feature maps, which enhances the performance of the adapter. This approach was validated on the task of meme video generation and achieved significant results. We hope this work can provide insights for post-training tasks of large text-to-image models. Additionally, as this method demonstrates good compatibility with SD1.5 derivative models, it holds certain value for the open-source community. Therefore, we will release the related code (\url{https://songkey.github.io/hellomeme}).
翻译:我们提出了一种在文本到图像基础模型中插入适配器的有效方法,该方法能够在执行复杂下游任务的同时保持基础模型的泛化能力。该方法的核心思想是优化与二维特征图相关的注意力机制,从而提升适配器的性能。此方法在表情包视频生成任务上得到了验证,并取得了显著成果。我们希望这项工作能为大型文本到图像模型的后续训练任务提供启示。此外,由于该方法与SD1.5衍生模型展现出良好的兼容性,对开源社区具有一定的价值。因此,我们将发布相关代码(\url{https://songkey.github.io/hellomeme})。