In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.
翻译:除了在想象创作中展现出前所未有的能力外,大型文本到图像模型还需要在图像生成中融入定制化概念。现有工作通常采用基于优化的方式学习此类概念,但会带来过多的计算或内存负担。本文提出一种基于学习的编码器,包含全局和局部映射网络,用于快速且准确地实现定制化文本到图像生成。具体而言,全局映射网络将给定图像的分层特征投影到文本词嵌入空间中的多个新词中,即一个用于可编辑概念的主词和其他用于排除无关干扰(如背景)的辅助词。同时,局部映射网络将编码后的块特征注入交叉注意力层以提供缺失细节,同时不牺牲主概念的编辑性。我们将所提方法与现有基于优化的方法在多种用户定义概念上进行比较,结果表明,我们的方法能以显著更快的编码过程实现高保真反演和更强的鲁棒编辑性。我们的代码公开于https://github.com/csyxwei/ELITE。