One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques. First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkably reduced training and storage costs for each concept.
翻译:实现灵活的实时设备端图像编辑的一个极具前景的方向是利用大规模文本到图像扩散模型进行数据蒸馏,生成用于训练生成对抗网络(GAN)的配对数据集。该方法显著缓解了通常需要高端商用GPU才能使用扩散模型进行图像编辑的严苛要求。然而,与文本到图像扩散模型不同,每个经过蒸馏的GAN都专用于特定的图像编辑任务,需要耗费大量训练成本才能获得适用于不同概念的模型。在本工作中,我们提出并解决了一个新的研究方向:能否使从扩散模型中蒸馏GAN的过程显著提高效率?为实现这一目标,我们提出了一系列创新技术。首先,我们构建了一个具有泛化特征的基础GAN模型,可通过微调适应不同概念,从而无需从头开始训练。其次,我们识别了基础GAN模型中的关键层,并采用低秩自适应(LoRA)配合一个简单而有效的秩搜索过程,而不是对整个基础模型进行微调。第三,我们研究了微调所需的最小数据量,进一步减少了总体训练时间。大量实验表明,我们能够高效地赋予GAN在移动设备上执行实时高质量图像编辑的能力,同时为每个概念显著降低训练和存储成本。