Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
翻译:扩大模型与数据规模在大型语言模型的发展中已取得显著成功,但扩散式文本生成图像模型的可扩展规律尚未得到充分探索,如何高效扩展模型以更低成本提升性能仍不明确。不同训练设置与高昂训练成本使得模型公平比较极为困难。本研究通过系统严格的消融实验,实证探究扩散式文本生成图像模型的扩展特性,对参数规模从0.4B至4B的缩放UNet与Transformer变体进行训练,并采用包含6亿图像的训练数据集。在模型扩展方面,我们发现交叉注意力的位置与数量是区分现有UNet设计性能的关键因素,且增加Transformer模块比增加通道数在提升文本-图像对齐方面更具参数效率。据此我们识别出一种高效UNet变体,其参数量较SDXL的UNet减少45%,推理速度快28%。在数据扩展方面,研究表明训练数据的质量与多样性比单纯数据集规模更为重要,提高描述文本密度与多样性可改善文本-图像对齐性能与学习效率。最终,我们提出预测函数,可根据模型规模、算力与数据集规模预测文本-图像对齐性能。