PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: $\href{https://pangu-draw.github.io}{this~https~URL}$

翻译：当前的大规模扩散模型在条件图像合成方面取得了巨大飞跃，能够解析文本、人体姿态、边缘等多种控制信号。然而，这些模型对大量计算资源和海量数据收集的依赖仍构成瓶颈。另一方面，现有针对不同控制且运行在独特潜在空间中的扩散模型，因图像分辨率与潜在空间嵌入结构不兼容而难以整合，限制了其联合应用。针对这些限制，我们提出“PanGu-Draw”，一种新型潜在扩散模型，专为资源高效的文本到图像合成设计，并能灵活适配多种控制信号。我们首先提出一种资源高效的时间解耦训练策略，将整体文本到图像模型拆分为结构生成器与纹理生成器。每个生成器采用最大化数据利用率和计算效率的训练方案，减少48%的数据准备量与51%的训练资源。其次，我们引入“Coop-Diffusion”算法，能够在统一去噪过程中协同使用具有不同潜在空间和预定义分辨率的多种预训练扩散模型，从而在无需额外数据或重新训练的情况下实现任意分辨率的多元控制图像合成。对PanGu-Draw的实验验证表明，其在文本到图像及多元控制图像生成方面展现出卓越能力，为未来模型训练效率与生成多样性的提升指明了方向。最大规模的5B参数文本到图像PanGu-Draw模型已在昇腾平台上发布。项目页面：$\href{https://pangu-draw.github.io}{this~https~URL}$

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日