Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.
翻译:扩散模型中的文本编码器已迅速发展,从CLIP演进至T5-XXL。尽管这一演进显著增强了模型理解复杂提示词和生成文本的能力,但也导致参数量大幅增加。尽管T5系列编码器在包含大量非视觉数据的C4自然语言语料库上训练,搭载T5编码器的扩散模型却未对非视觉提示词作出响应,这表明其表征能力存在冗余。因此,这引发了一个重要问题:"我们是否真的需要如此庞大的文本编码器?"为探寻答案,我们采用基于视觉的知识蒸馏方法训练了一系列T5编码器模型。为完整继承其能力,我们依据图像质量、语义理解和文本渲染三项标准构建了数据集。实验结果表明:经过蒸馏的T5-base模型能生成与T5-XXL质量相当的图像,而其模型尺寸缩小了50倍。这种模型规模的缩减显著降低了对运行FLUX、SD3等前沿模型的GPU需求,使高质量文生图技术更具普及性。