Despite unprecedented ability in imaginary creation, large text-to-image models are further expected to express customized concepts. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder for fast and accurate concept customization, which consists of global and local mapping networks. In specific, the global mapping network separately projects the hierarchical features of a given image into multiple ``new'' words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with prior optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables more high-fidelity inversion and robust editability with a significantly faster encoding process. Our code will be publicly available at https://github.com/csyxwei/ELITE.
翻译:尽管大规模文本到图像模型在虚构创作方面展现出前所未有的能力,但人们进一步期望其能够表达定制化概念。现有工作通常采用基于优化的方式学习此类概念,然而这带来了过多的计算或内存负担。本文提出一种基于学习的编码器,用于快速且准确的概念定制,该编码器由全局和局部映射网络组成。具体而言,全局映射网络将给定图像的层次化特征分别投影到文本词嵌入空间中的多个“新”词上,即一个用于可良好编辑概念的主要词,以及用于排除无关干扰(如背景)的辅助词。同时,局部映射网络将编码后的图像块特征注入交叉注意力层以补充缺失细节,且不牺牲主要概念的可编辑性。我们将方法与此前多种基于优化的方法在用户自定义概念上进行了比较,结果表明我们方法能实现更高保真度的概念反转和鲁棒的可编辑性,且编码过程显著更快。我们的代码将开源在 https://github.com/csyxwei/ELITE。