RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a webpage: https://miaohua.sensetime.com/en.

翻译：文本到图像生成技术近期取得了显著进展。我们提出了一种名为RAPHAEL的文本条件图像扩散模型，旨在生成高度艺术化的图像，能够精准描绘包含多个名词、形容词及动词的文本提示。该模型通过堆叠数十个混合专家（MoEs）层——即空间MoE层与时间MoE层——实现了从网络输入到输出的数十亿条扩散路径（路线）。每条路径直观地扮演着"画师"角色，在特定扩散时间步将某个文本概念呈现到指定图像区域。综合实验表明，RAPHAEL在图像质量与美学吸引力上均超越了Stable Diffusion、ERNIE-ViLG 2.0、DeepFloyd及DALL-E 2等最新顶尖模型。首先，RAPHAEL在日式漫画、写实主义、赛博朋克、水墨插画等多种风格间的图像切换中展现出卓越性能。其次，该模型在1000块A100 GPU上训练两个月，参数规模达30亿，在COCO数据集上实现了零样本FID得分6.61的当前最优水平。此外，在ViLG-300基准测试的人工评估中，RAPHAEL显著优于同类方法。我们相信RAPHAEL有潜力推动图像生成研究的前沿发展，为这一快速演进领域未来的突破铺平道路。更多细节请访问网页：https://miaohua.sensetime.com/en。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日