Recent advancements in text-to-image diffusion models have enabled the personalization of these models to generate custom images from textual prompts. This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation, where pre-trained diffusion models are fine-tuned with user-specific data on resource-constrained devices. Our method, termed Hollowed Net, enhances memory efficiency during fine-tuning by modifying the architecture of a diffusion U-Net to temporarily remove a fraction of its deep layers, creating a hollowed structure. This approach directly addresses on-device memory constraints and substantially reduces GPU memory requirements for training, in contrast to previous methods that primarily focus on minimizing training steps and reducing the number of parameters to update. Additionally, the personalized Hollowed Net can be transferred back into the original U-Net, enabling inference without additional memory overhead. Quantitative and qualitative analyses demonstrate that our approach not only reduces training memory to levels as low as those required for inference but also maintains or improves personalization performance compared to existing methods.
翻译:近年来,文本到图像扩散模型的发展使得这些模型能够根据文本提示生成定制化图像。本文提出了一种基于LoRA的高效个性化方法,用于设备端的主体驱动生成,即在资源受限的设备上使用用户特定数据对预训练扩散模型进行微调。我们提出的方法称为“空洞网络”,通过修改扩散U-Net的架构,在微调期间临时移除其深层网络的一部分以形成空洞结构,从而提升内存效率。与先前主要关注减少训练步骤和降低待更新参数数量的方法不同,此方法直接应对设备端内存限制,并显著降低了训练所需的GPU内存。此外,个性化后的空洞网络可以转换回原始U-Net,从而实现无需额外内存开销的推理。定量与定性分析表明,我们的方法不仅将训练内存降低至与推理相当的水平,而且与现有方法相比,在保持甚至提升个性化性能方面表现优异。