We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution scales. We show that this method outperforms the baseline of synthesizing images only at the target resolution, while reducing the computational cost per step. We demonstrate that higher resolution synthesis can be achieved by layering convolutions at additional resolution scales, in contrast to other methods which require additional models for super-resolution synthesis.
翻译:我们提出了一种单次文本到图像扩散模型,能够根据自然语言描述生成高分辨率图像。该模型采用分层U-Net架构,可在多个分辨率尺度上同时合成图像。研究表明,与仅在目标分辨率下合成图像的基线方法相比,本方法在降低单步计算成本的同时实现了更优性能。我们证明,通过在不同分辨率尺度上叠加卷积操作即可实现更高分辨率合成,而无需像其他方法那样依赖额外模型进行超分辨率合成。