Diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of high-quality images from text prompts or other modalities. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, their large number of parameters is inefficient for model storage. In this paper, we propose a novel approach to address these limitations in existing text-to-image diffusion models for personalization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language-drifting. We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size (1.7MB for StableDiffusion) compared to existing methods (vanilla DreamBooth 3.66GB, Custom Diffusion 73MB), making it more practical for real-world applications.
翻译:扩散模型在文本到图像生成方面取得了显著成功,能够根据文本提示或其他模态生成高质量图像。然而,现有定制这些模型的方法在处理多个个性化主体时能力有限,且存在过拟合风险。此外,其庞大的参数量导致模型存储效率低下。本文提出了一种新方法来解决现有文本到图像扩散模型在个性化任务中的上述缺陷。我们的方法通过微调权重矩阵的奇异值来实现,从而构建一个紧凑高效的参数空间,降低了过拟合和语言漂移的风险。我们还提出了一种Cut-Mix-Unmix数据增强技术,用于提升多主体图像生成的质量,并构建了一个简易的基于文本的图像编辑框架。所提出的SVDiff方法相较于现有方法(原始DreamBooth 3.66GB,Custom Diffusion 73MB)具有显著更小的模型体积(针对StableDiffusion仅需1.7MB),使其在实际应用中更具实用性。