Diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of high-quality images from text prompts or other modalities. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, their large number of parameters is inefficient for model storage. In this paper, we propose a novel approach to address these limitations in existing text-to-image diffusion models for personalization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language drifting. We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size compared to existing methods (approximately 2,200 times fewer parameters compared with vanilla DreamBooth), making it more practical for real-world applications.
翻译:扩散模型在文本到图像生成中取得了显著成功,使得从文本提示或其他模态创建高质量图像成为可能。然而,现有定制这些模型的方法在处理多个个性化主体时存在局限性,并伴有过拟合风险。此外,其庞大的参数量导致模型存储效率低下。本文提出了一种新方法,旨在解决现有用于个性化的文本到图像扩散模型中的这些局限。我们的方法涉及对权重矩阵的奇异值进行微调,从而构建一个紧凑且高效的参数空间,降低了过拟合和语言漂移的风险。我们还提出了一种Cut-Mix-Unmix数据增强技术,以提升多主体图像生成的质量,并构建了一个简单的基于文本的图像编辑框架。与现有方法(例如,与原始DreamBooth相比,参数量减少约2200倍)相比,我们提出的SVDiff方法具有显著更小的模型规模,使其在实际应用中更具实用性。