Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.
翻译:文本到图像生成模型代表了图像合成领域的下一步演进,提供了一种自然的方式来实现对结果的灵活且细粒度的控制。一个新兴的研究领域是将大型文本到图像模型快速适应到较小的数据集或新的视觉概念中。然而,许多高效的适应方法训练时间较长,这限制了它们的实际应用,拖慢了实验进度,并消耗了过多的GPU资源。在这项工作中,我们研究了流行的文本到图像个性化方法(如 Textual Inversion 或 DreamBooth)的训练动态,旨在加速它们。我们观察到,大多数概念在早期阶段就能被学习,且后期质量并无提升,但标准的训练收敛指标却无法反映这一点。为此,我们提出了一种简单的即插即用式早停准则,该准则仅需在所有训练迭代中,针对一组固定的输入计算常规训练目标即可。我们对Stable Diffusion在48个不同概念和三种个性化方法上进行的实验表明,我们的方法具有竞争力的性能,能够在不明显降低质量的前提下,将适应速度提升高达8倍。