This paper proposes FreeFuse, a training-free framework for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to prior studies that focus on retraining LoRA to alleviate feature conflicts, our analysis reveals that simply spatially confining the subject LoRA's output to its target region and preventing other LoRAs from directly intruding into this area is sufficient for effective mitigation. Accordingly, we implement Adaptive Token-Level Routing during the inference phase. We introduce FreeFuseAttn, a mechanism that exploits the flow matching model's intrinsic semantic alignment to dynamically match subject-specific tokens to their corresponding spatial regions at early denoising timesteps, thereby bypassing the need for external segmentors. FreeFuse distinguishes itself through high practicality: it necessitates no additional training, model modifications, or user-defined masks spatial conditions. Users need only provide subject activation words to achieve seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity. Our code is available at https://github.com/yaoliliu/FreeFuse.
翻译:本文提出FreeFuse,一种无需训练的框架,通过自动融合多个主体LoRA实现多主体文本到图像生成。与先前研究侧重于重新训练LoRA以缓解特征冲突不同,我们的分析表明,仅需将主体LoRA的输出空间限制在其目标区域,并阻止其他LoRA直接侵入该区域,即可有效缓解冲突。为此,我们在推理阶段实现了自适应令牌级路由。我们引入FreeFuseAttn机制,该机制利用流匹配模型固有的语义对齐能力,在早期去噪步骤中将主体特定令牌动态匹配到对应的空间区域,从而无需外部分割器。FreeFuse的突出优势在于高度实用性:无需额外训练、模型修改或用户定义的空间掩码条件。用户仅需提供主体激活词即可无缝集成到标准工作流程中。大量实验验证表明,FreeFuse在身份保持和组合保真度方面均优于现有方法。代码发布于https://github.com/yaoliliu/FreeFuse。