PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

翻译：最先进的文本到图像（T2I）模型需要巨大的训练成本（例如数百万GPU小时），这严重阻碍了AIGC社区的基础创新，同时增加了二氧化碳排放。本文提出PIXART-$\alpha$，一种基于Transformer的T2I扩散模型，其图像生成质量可与最先进的图像生成器（例如Imagen、SDXL甚至Midjourney）相媲美，达到近商业应用标准。此外，它支持高达1024px分辨率的高分辨率图像合成，且训练成本较低，如图1和图2所示。为实现这一目标，本文提出三项核心设计：(1) 训练策略分解：我们设计了三个不同的训练步骤，分别优化像素依赖性、文本-图像对齐和图像美学质量；(2) 高效T2I Transformer：我们将交叉注意力模块集成到扩散Transformer（DiT）中，以注入文本条件并简化计算密集的类别条件分支；(3) 高信息量数据：我们强调文本-图像对中概念密度的重要性，并利用大型视觉-语言模型自动标注密集的伪描述文本，以辅助文本-图像对齐学习。因此，PIXART-$\alpha$的训练速度显著超过现有大规模T2I模型，例如PIXART-$\alpha$仅需Stable Diffusion v1.5 10.8%的训练时间（675 vs. 6,250 A100 GPU天），节省近30万美元（2.6万 vs. 32万美元），并减少90%的二氧化碳排放。此外，与更大的SOTA模型RAPHAEL相比，我们的训练成本仅为1%。大量实验证明，PIXART-$\alpha$在图像质量、艺术性和语义控制方面表现出色。我们希望PIXART-$\alpha$能为AIGC社区和初创公司提供新见解，以加速从零构建高质量且低成本的生成模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日