Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available through the online platform and API after further optimization and security checks.
翻译:基于仅解码器Transformer的大型语言模型(LLMs)已展现出优于CLIP和T5系列模型的文本理解能力。然而,在当前文本到图像扩散模型中利用先进LLMs的范式仍有待探索。我们发现了一个异常现象:直接使用大型语言模型作为提示编码器会显著降低图像生成中的提示跟随能力。我们识别了导致此问题的两个主要障碍:一是LLM中下一词预测训练与扩散模型对判别性提示特征需求之间的错位;二是仅解码器架构引入的内在位置偏差。为解决此问题,我们提出了一个新颖框架以充分发挥LLMs的潜力。通过精心设计的使用指导,我们有效增强了提示编码的文本表示能力并消除了其固有的位置偏差。这使得我们能够灵活地将最先进的LLMs集成到文本到图像生成模型中。此外,我们还提供了一种将多个LLMs融合到我们框架中的有效方式。考虑到Transformer架构展示出的卓越性能和扩展能力,我们进一步基于该框架设计了LLM增强扩散Transformer(LI-DiT)。我们进行了大量实验,从模型规模和数据规模两个维度验证LI-DiT的有效性。得益于LLMs的固有能力和我们的创新设计,LI-DiT的提示理解性能轻松超越了最先进的开源模型以及包括Stable Diffusion 3、DALL-E 3和Midjourney V6在内的主流闭源商业模型。功能强大的LI-DiT-10B模型将在进一步优化和安全检查后通过在线平台和API提供。