Transfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the Domain-guided Fine-tuning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity-diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., late-start and cut-off, which improve generation quality and training stability. Experiments on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and FDDINOV2 while requiring up to 2x fewer sampling TFLOPS.
翻译:将扩散模型迁移到较小的目标领域具有挑战性,因为直接对模型进行微调通常会导致泛化性能不佳。测试时引导方法通过在图像保真度与样本多样性之间进行权衡,提供了可控的改进,从而有助于缓解这一问题。然而,这种改进是以高昂的计算成本为代价的,通常需要在采样过程中进行双重前向传播。我们提出了领域引导微调(DogFit)方法,这是一种用于扩散迁移学习的有效引导机制,能够在保持可控性的同时不产生额外的计算开销。DogFit将领域感知的引导偏移量注入训练损失中,从而在微调过程中有效地将引导行为内化。这种领域感知设计的动机源于我们的观察:在微调过程中,无条件的源模型比目标模型提供了更强的边缘估计。为了在推理时支持高效的可控保真度-多样性权衡,我们通过轻量级条件机制将引导强度值编码为额外的模型输入。我们进一步研究了训练过程中引导偏移量的最优放置时机,并提出了两种简单的调度策略,即延迟启动与截断策略,以提高生成质量和训练稳定性。在DiT和SiT骨干网络上,跨越六个不同目标领域的实验表明,DogFit在迁移学习中能够在FID和FDDINOV2指标上优于先前的引导方法,同时采样所需的TFLOPS最多可减少2倍。