Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, the latent space of initial seeds is still not well understood and its structure was shown to impact the generation of various concepts. Specifically, simple operations like interpolation and finding the centroid of a set of seeds perform poorly when using standard Euclidean or spherical metrics in the latent space. This paper makes the observation that, in current training procedures, diffusion models observed inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation, with applications to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this interpolation procedure and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approaches in terms of generation speed, image quality, and semantic content.
翻译:文本到图像扩散模型在合成新组合场景中的多样化概念表现出巨大潜力。然而,初始种子的潜空间结构仍未被充分理解,其结构被证明会影响不同概念的生成效果。具体而言,在潜空间中使用标准欧几里得度量或球面度量时,插值和种子集质心计算等简单操作的效果不佳。本文发现,在当前训练过程中,扩散模型观察到的输入具有狭窄的范数值范围。这一发现对基于种子操作的图像生成方法产生重要影响,尤其在少样本和长尾学习任务中。为解决该问题,我们提出一种新颖的双种子插值方法,并证明其定义了一种考虑种子范数先验的非欧几里得度量。我们设计了一个简单高效的算法来近似该插值过程,并进一步定义了潜种子空间中的质心。实验表明,我们的新插值与质心技术能显著提升罕见概念图像的生成质量。这进一步在少样本和长尾基准测试中达到最优性能,在生成速度、图像质量和语义内容方面均超越现有方法。