The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
翻译:大规模文本到图像(T2I)模型展现出的惊人生成能力,证明了其在学习复杂结构和有意义语义方面的强大实力。然而,仅依赖文本提示无法充分利用模型所习得的知识,尤其当需要灵活且精确的结构控制时。本文旨在“挖掘”T2I模型已隐式习得的能力,并将其显式用于更细粒度的生成控制。具体而言,我们提出学习简单且轻量的T2I适配器,在冻结原始大规模T2I模型的同时,将T2I模型的内部知识与外部控制信号对齐。通过这种方式,我们可根据不同条件训练各类适配器,实现丰富的控制与编辑效果。此外,所提出的T2I适配器具有实用价值的诱人特性,如可组合性和泛化能力。大量实验表明,我们的T2I适配器在生成质量上表现优异,且应用场景广泛。