Fine-Tuning Stable Diffusion XL for Stylistic Icon Generation: A Comparison of Caption Size

In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID scores reflect the similarity of generated images to the overall training set, CLIP scores measure the alignment between generated images and their textual descriptions. We show how FID scores miss significant aspects, such as the minority of pixel differences that matter most in an icon, while CLIP scores result in misjudging the quality of icons. The CLIP model's understanding of "similarity" is shaped by its own training data; which does not account for feature variation in our style of choice. Our findings highlight the need for specialized evaluation metrics and fine-tuning approaches when generating high-quality commercial icons, potentially leading to more effective and tailored applications of text-to-image models in professional design contexts.

翻译：本文展示了针对 Stable Diffusion XL 的不同微调方法，包括推理步骤以及为每张图像定制标题，以使其生成与商业二维图标训练集风格一致的图像。我们还阐明了正确定义“高质量”的内涵至关重要，尤其是在商业使用环境中。随着生成式人工智能模型获得日益广泛的认可与应用，出现了多种针对不同应用场景优化和评估这些模型的方法。具体而言，文本到图像模型（如 Stable Diffusion XL 和 DALL-E 3）需要独特的评估实践，以根据特定风格有效生成高质量图标。尽管基于某种风格生成的某些图像可能具有较低的 FID 分数（表现更好），但我们证明了即使对于栅格化图标，这本身也并非绝对标准。FID 分数反映了生成图像与整个训练集的相似性，而 CLIP 分数则衡量生成图像与其文本描述之间的一致性。我们揭示了 FID 分数会遗漏重要方面，例如图标中至关重要的少数像素差异，而 CLIP 分数则会导致对图标质量的误判。CLIP 模型对“相似性”的理解受其自身训练数据的影响，并未考虑我们所选风格中的特征变化。我们的研究结果强调了在生成高质量商业图标时，需要专门的评估指标和微调方法，这可能促使文本到图像模型在专业设计语境中得到更有效和更具针对性的应用。