An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.

翻译：对于高保真度的文生图生成而言，一个关键前提是准确理解文本输入。现有方法通常使用CLIP模型的文本编码器来表示输入提示词。然而，预训练CLIP模型仅能编码最大标记长度为77的英文文本。此外，与大语言模型（LLMs）相比，CLIP文本编码器的模型容量相对有限，而LLMs不仅支持多语言输入、能容纳更长的上下文，还能实现更优的文本表征。本文探究了将LLMs用作文本编码器以提升文生图生成任务中的语言理解能力。但直接使用LLMs从头训练文生图生成模型需要消耗大量的计算资源和数据。为此，我们提出了一种三阶段训练流程，能够高效、有效地将现有文生图模型与LLMs结合。具体而言，我们设计了一个轻量级适配器，使得文生图模型能够利用LLMs的文本表征进行快速训练。大量实验表明，我们的模型不仅支持多语言输入，还能生成更优质量的图像，且输入文本的上下文长度得以显著扩展。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日