Text-to-image generation models represent the next step of evolution in image synthesis, offering natural means of flexible yet fine-grained control over the result. One emerging area of research is the rapid adaptation of large text-to-image models to smaller datasets or new visual concepts. However, the most efficient method of adaptation, called textual inversion, has a known limitation of long training time, which both restricts practical applications and increases the experiment time for research. In this work, we study the training dynamics of textual inversion, aiming to speed it up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard model convergence metrics fail to indicate that. Instead, we propose a simple early stopping criterion that only requires computing the textual inversion loss on the same inputs for all training iterations. Our experiments on both Latent Diffusion and Stable Diffusion models for 93 concepts demonstrate the competitive performance of our method, speeding adaptation up to 15 times with no significant drops in quality.
翻译:文本到图像生成模型代表了图像合成领域的下一步演进,提供了对结果进行灵活且细粒度控制的自然方式。一个新兴的研究方向是将大型文本到图像模型快速适应到较小数据集或新的视觉概念中。然而,最高效的适应方法——即文本反转——存在已知的训练时间过长问题,这不仅限制了实际应用,也增加了研究中的实验时间。在本研究中,我们探讨了文本反转的训练动态,旨在加速这一过程。我们观察到,大多数概念在早期阶段就已学习完成,后续并未显著提升质量,但标准的模型收敛指标无法反映这一点。因此,我们提出一个简单的早停准则,该准则仅需在所有训练迭代中计算相同输入下的文本反转损失。我们在潜在扩散模型和稳定扩散模型上对93个概念进行的实验表明,我们的方法具有竞争性性能,可将适应速度提升至15倍,且质量无明显下降。