Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available through the online platform and API after further optimization and security checks.

翻译：基于仅解码器Transformer的大型语言模型（LLMs）已展现出优于CLIP和T5系列模型的文本理解能力。然而，在当前文本到图像扩散模型中利用先进LLMs的范式仍有待探索。我们发现了一个异常现象：直接使用大型语言模型作为提示编码器会显著降低图像生成中的提示跟随能力。我们识别了导致此问题的两个主要障碍：一是LLM中下一词预测训练与扩散模型对判别性提示特征需求之间的错位；二是仅解码器架构引入的内在位置偏差。为解决此问题，我们提出了一个新颖框架以充分发挥LLMs的潜力。通过精心设计的使用指导，我们有效增强了提示编码的文本表示能力并消除了其固有的位置偏差。这使得我们能够灵活地将最先进的LLMs集成到文本到图像生成模型中。此外，我们还提供了一种将多个LLMs融合到我们框架中的有效方式。考虑到Transformer架构展示出的卓越性能和扩展能力，我们进一步基于该框架设计了LLM增强扩散Transformer（LI-DiT）。我们进行了大量实验，从模型规模和数据规模两个维度验证LI-DiT的有效性。得益于LLMs的固有能力和我们的创新设计，LI-DiT的提示理解性能轻松超越了最先进的开源模型以及包括Stable Diffusion 3、DALL-E 3和Midjourney V6在内的主流闭源商业模型。功能强大的LI-DiT-10B模型将在进一步优化和安全检查后通过在线平台和API提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日