Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
翻译:稳定扩散是一种用于文本到图像合成的生成模型,在生成不同尺寸图像时经常遇到分辨率引发的构图问题。这个问题主要源于该模型是在单尺度图像及其对应文本描述的对齐数据上训练的。此外,直接对无限尺寸图像进行训练是不可行的,因为这需要海量的文本-图像对并产生巨大的计算开销。为克服这些挑战,我们提出了一种名为任意尺寸扩散(ASD)的两阶段流水线,旨在高效生成任意尺寸构图良好的图像,同时最大程度减少对高内存GPU资源的需求。具体来说,第一阶段称为任意比例自适应扩散(ARAD),利用精心筛选的有限比例范围图像来优化文本条件扩散模型,从而增强其调整构图以适应不同图像尺寸的能力。为支持生成任意期望尺寸的图像,我们在后续阶段进一步引入了一种名为快速无缝块扩散(FSTD)的技术。该方法允许将ASD输出快速放大到任意高分辨率尺寸,同时避免拼接伪影或内存过载。在LAION-COCO和MM-CelebA-HQ基准上的实验结果表明,ASD能够生成任意尺寸的结构良好图像,与传统分块算法相比,推理时间减少了两倍。