The ability to fine-tune generative models for text-to-image generation tasks is crucial, particularly facing the complexity involved in accurately interpreting and visualizing textual inputs. While LoRA is efficient for language model adaptation, it often falls short in text-to-image tasks due to the intricate demands of image generation, such as accommodating a broad spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a specialized fine-tuning approach tailored for text-to-image models. StyleInject comprises multiple parallel low-rank parameter matrices, maintaining the diversity of visual features. It dynamically adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. This approach significantly minimizes the impact on the original model's text-image alignment capabilities while adeptly adapting to various styles in transfer learning. StyleInject proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models. Our comprehensive experiments, including both small-sample and large-scale data fine-tuning as well as base model distillation, show that StyleInject surpasses traditional LoRA in both text-image semantic consistency and human preference evaluation, all while ensuring greater parameter efficiency.
翻译:精准微调文本到图像生成模型的生成能力至关重要,尤其当面临准确解析与可视化文本输入的复杂性时。尽管LoRA(Low-Rank Adaptation)在语言模型适配中表现高效,但面对图像生成任务中需兼容广泛风格与细微特征的复杂需求,其性能往往不足。为弥补这一差距,我们提出StyleInject——一种专为文本到图像模型设计的微调方法。该方法包含多个并行低秩参数矩阵,以维持视觉特征的多样性,并通过根据输入信号特征动态调整视觉特征的方差来适配不同风格。该方案在最小化对原模型文本-图像对齐能力影响的同时,能灵活适应迁移学习中的各类风格。实验表明,StyleInject尤其擅长从社区微调的高级生成模型中学习并提升其性能。我们在小样本与大规模数据微调及基模型蒸馏上的综合实验证明,StyleInject在文本-图像语义一致性与人类偏好评估上均超越传统LoRA,同时实现了更高的参数效率。