This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.
翻译:本文提出了一种根据用户指定对象生成图像的方法。该方法基于一个通用框架,避免了以往需要针对每个对象进行冗长优化的范式。该框架采用编码器捕获对象的高级可识别语义,仅通过一次前向传播即可生成对象特定的嵌入。获取的对象嵌入随后被输入文本到图像合成模型进行后续生成。为了在相同的生成上下文中将对象感知的嵌入空间有效融入成熟的文本到图像模型,我们研究了不同的网络设计和训练策略,并提出了一种简单但有效的正则化联合训练方案,结合了对象身份保持损失。此外,我们提出了一种字幕生成方案,该方案成为促进对象特定嵌入忠实反映到生成过程中的关键环节,同时保持控制和编辑能力。一旦训练完成,该网络能够基于文本和对象条件生成多样化的内容和风格。通过实验证明,我们提出的方法能够在无需测试时优化的情况下,合成具有令人信服的输出质量、外观多样性和对象保真度的图像。我们还进行了系统性研究以分析模型,为未来工作提供了见解。