Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Anton Razzhigaev,Arseniy Shakhmatov,Anastasia Maltseva,Vladimir Arkhipkin,Igor Pavlov,Ilya Ryabov,Angelina Kuts,Alexander Panchenko,Andrey Kuznetsov,Denis Dimitrov

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

翻译：文本到图像生成是现代计算机视觉中的重要领域，并随着生成式架构的演进取得了显著进步。其中，基于扩散的模型展现出关键的质量提升。这类模型通常分为两类：像素级方法和潜在级方法。我们提出Kandinsky1，一种对潜在扩散架构的新探索，融合了图像先验模型与潜在扩散技术的原理。图像先验模型被单独训练，用于将CLIP的文本嵌入映射为图像嵌入。该模型的另一个显著特征是改进的MoVQ实现，作为图像自编码器组件。整体而言，所设计模型包含33亿参数。我们还部署了一个用户友好的演示系统，支持多种生成模式，包括文本到图像生成、图像融合、文本与图像融合、图像变体生成以及文本引导的图像内补/外扩。此外，我们开源了Kandinsky模型的源代码与检查点。实验评估表明，该方法在COCO-30K数据集上取得了8.03的FID分数，成为可测图像生成质量方面表现最佳的开源模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日