Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.
翻译:摘要:扩散模型在文本到图像生成领域展现了卓越的性能。然而,绝大多数广泛使用的模型仍采用CLIP作为其文本编码器,这限制了它们理解密集提示的能力,包括多对象、细粒度属性、复杂关系、长文本对齐等。本文提出了一种高效大语言模型适配器ELLA,它无需训练U-Net或大语言模型(LLM)即可为文本到图像扩散模型配备强大的大语言模型以增强文本对齐。为了无缝桥接两个预训练模型,我们研究了多种语义对齐连接器设计,并提出了一种新模块——时序感知语义连接器(TSC),该模块能从大语言模型中动态提取依赖于时间步长的条件。我们的方法在去噪过程的不同阶段自适应调整语义特征,帮助扩散模型在采样时间步长中解释冗长复杂的提示。此外,ELLA可便捷地集成到社区模型和工具中,以提升其提示遵循能力。为评估文本到图像模型在密集提示遵循方面的表现,我们引入了一个包含1000条密集提示的具有挑战性的基准——密集提示图基准(DPG-Bench)。大量实验表明,ELLA在密集提示遵循方面优于现有最先进方法,尤其在涉及多重属性与关系的多对象组合任务中表现突出。