ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

翻译：摘要：扩散模型在文本到图像生成领域展现了卓越的性能。然而，绝大多数广泛使用的模型仍采用CLIP作为其文本编码器，这限制了它们理解密集提示的能力，包括多对象、细粒度属性、复杂关系、长文本对齐等。本文提出了一种高效大语言模型适配器ELLA，它无需训练U-Net或大语言模型（LLM）即可为文本到图像扩散模型配备强大的大语言模型以增强文本对齐。为了无缝桥接两个预训练模型，我们研究了多种语义对齐连接器设计，并提出了一种新模块——时序感知语义连接器（TSC），该模块能从大语言模型中动态提取依赖于时间步长的条件。我们的方法在去噪过程的不同阶段自适应调整语义特征，帮助扩散模型在采样时间步长中解释冗长复杂的提示。此外，ELLA可便捷地集成到社区模型和工具中，以提升其提示遵循能力。为评估文本到图像模型在密集提示遵循方面的表现，我们引入了一个包含1000条密集提示的具有挑战性的基准——密集提示图基准（DPG-Bench）。大量实验表明，ELLA在密集提示遵循方面优于现有最先进方法，尤其在涉及多重属性与关系的多对象组合任务中表现突出。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日