ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.

翻译：除了在想象创作中展现出前所未有的能力外，大型文本到图像模型还需要在图像生成中融入定制化概念。现有工作通常采用基于优化的方式学习此类概念，但会带来过多的计算或内存负担。本文提出一种基于学习的编码器，包含全局和局部映射网络，用于快速且准确地实现定制化文本到图像生成。具体而言，全局映射网络将给定图像的分层特征投影到文本词嵌入空间中的多个新词中，即一个用于可编辑概念的主词和其他用于排除无关干扰（如背景）的辅助词。同时，局部映射网络将编码后的块特征注入交叉注意力层以提供缺失细节，同时不牺牲主概念的编辑性。我们将所提方法与现有基于优化的方法在多种用户定义概念上进行比较，结果表明，我们的方法能以显著更快的编码过程实现高保真反演和更强的鲁棒编辑性。我们的代码公开于https://github.com/csyxwei/ELITE。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日