Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.
翻译:近期研究展示了在仅提供每个概念少量示例图像的情况下,以序列化(即持续)方式对文本到图像扩散模型进行多细粒度概念定制的卓越能力,该设定被称为持续扩散。在此,我们提出疑问:能否在不遗忘的前提下,将这些方法扩展至更长的概念序列?尽管先前研究缓解了对已学习概念的遗忘问题,但本研究发现,其学习新任务的能力在长序列中会达到饱和。为应对这一挑战,我们提出了一种新方法——堆叠掩码增量适配器(STAMINA),该方法由低秩注意力掩码适配器和定制化MLP令牌组成。STAMINA通过可学习的硬注意力掩码(以低秩MLP参数化)增强LoRA的鲁棒微调特性,从而在序列化概念学习中实现精确、可扩展的稀疏适应。值得注意的是,所有引入的可训练参数在训练后可重新折叠回模型,且不产生额外推理参数成本。结果表明,在由地标和人脸构成的50个概念基准测试中,STAMINA在文本到图像持续定制设定下超越了现有最优方法,且无需存储回放数据。此外,该方法还被扩展至图像分类的持续学习场景,证明其性能提升同样在该标准基准中达到了最先进水平。