Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.
翻译:扩散模型(DMs)最近因在文生图合成中展现出最先进性能而受到关注。遵循深度学习传统,扩散模型在固定尺寸的图像上进行训练和评估。然而,用户需要具有特定尺寸和不同宽高比的多样化图像。本文聚焦于自适应文生图扩散模型以处理这种多样性,同时保持视觉保真度。首先我们观察到,在合成过程中,低分辨率图像会出现对象描绘不完整的问题,而高分辨率图像则呈现重复无序的表达。接着,我们建立了一个统计关系,表明注意力熵随标记数量变化,这意味着模型按图像分辨率比例聚合空间信息。随后对我们的观察进行解释:对象描绘不完整是由于低分辨率下空间信息有限,而重复无序呈现则源于高分辨率下的冗余空间信息。基于此视角,我们提出一个缩放因子来缓解注意力熵的变化并抑制观察到的缺陷模式。大量实验结果验证了所提缩放因子的有效性,能使模型实现更优的视觉效果、图像质量与文本对齐。值得注意的是,这些改进无需额外的训练或微调技术即可实现。