Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
翻译:稳定扩散是一种用于文本到图像合成的生成模型,在生成不同尺寸图像时经常遇到分辨率引起的构图问题。这一问题主要源于模型是在单一尺度图像及其对应文本描述的对上进行训练的。此外,直接对无限尺寸图像进行训练是不可行的,因为这需要海量的文本-图像对,并产生巨大的计算成本。为克服这些挑战,我们提出了一种名为任意尺寸扩散(ASD)的两阶段流水线,旨在高效生成任意尺寸且构图良好的图像,同时最大限度地减少对高内存GPU资源的需求。具体而言,第一阶段称为任意比例自适应扩散(ARAD),利用一组比例范围受限的精选图像来优化文本条件扩散模型,从而提高其调整构图以适应不同图像尺寸的能力。为支持生成任意期望尺寸的图像,我们在后续阶段进一步引入了一种称为快速无缝平铺扩散(FSTD)的技术。该方法允许将ASD的输出快速放大至任意高分辨率尺寸,同时避免拼接伪影或内存过载。在LAION-COCO和MM-CelebA-HQ基准上的实验结果表明,ASD能够生成任意尺寸的结构良好图像,与传统平铺算法相比,推理时间减少了2倍。