Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
翻译:文本到图像生成是现代计算机视觉中的重要领域,并随着生成式架构的演进取得了显著进步。其中,基于扩散的模型展现出关键的质量提升。这类模型通常分为两类:像素级方法和潜在级方法。我们提出Kandinsky1,一种对潜在扩散架构的新探索,融合了图像先验模型与潜在扩散技术的原理。图像先验模型被单独训练,用于将CLIP的文本嵌入映射为图像嵌入。该模型的另一个显著特征是改进的MoVQ实现,作为图像自编码器组件。整体而言,所设计模型包含33亿参数。我们还部署了一个用户友好的演示系统,支持多种生成模式,包括文本到图像生成、图像融合、文本与图像融合、图像变体生成以及文本引导的图像内补/外扩。此外,我们开源了Kandinsky模型的源代码与检查点。实验评估表明,该方法在COCO-30K数据集上取得了8.03的FID分数,成为可测图像生成质量方面表现最佳的开源模型。