Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of $2.19\%$ in CLIPScore, and $11.25\%$ in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch.
翻译:低秩适应(LoRA)已成为文本到图像模型中广泛采用的技术,能够在多概念图像生成中精确渲染多个不同元素,如角色和风格。然而,当前方法在组合这些LoRA进行多概念图像生成时面临显著挑战,导致生成图像质量下降。本文首先通过傅里叶频域视角研究LoRA在去噪过程中的作用。基于应用多个LoRA可能导致“语义冲突”的假设,我们发现某些LoRA会放大高频特征(如边缘和纹理),而其他LoRA主要关注低频元素(包括整体结构和平滑的颜色渐变)。基于这些发现,我们设计了一种基于频域的排序策略,以确定在推理过程中集成LoRA的最佳顺序。与现有LoRA融合技术中常见的简单集成相比,该策略提供了一种系统化且可推广的解决方案。为了在多LoRA组合任务中充分利用我们提出的LoRA顺序确定方法,我们引入了一种新颖的无训练框架——缓存多LoRA(CMLoRA),旨在高效集成多个LoRA的同时保持连贯的图像生成。凭借其灵活的多LoRA融合主干和针对单个LoRA定制的非均匀缓存策略,CMLoRA有潜力减少LoRA组合中的语义冲突并提高计算效率。我们的实验评估表明,CMLoRA显著优于最先进的无训练LoRA融合方法——与LoraHub、LoRA Composite和LoRA Switch相比,其在CLIPScore上平均提升$2.19\%$,在MLLM胜率上提升$11.25\%$。