The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the latent noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some latent seeds that produce better images than others, opening novel directions of research on this understudied topic.
翻译:合成图像的生成进展使其质量评估变得至关重要。虽然已有多种指标用于评估图像的渲染质量,但对于根据提示生成图像的文本到图像(T2I)模型而言,还需要考虑其他方面,例如生成图像与提示重要内容的匹配程度。此外,尽管生成的图像通常来自随机起点,但该起点的影响通常未被考虑。在本文中,我们提出了一种基于提示模板的新指标,用于研究提示中指定的内容与相应生成图像之间的一致性。这使我们能够根据指定对象的类型、数量及其颜色更好地表征一致性。我们对多个近期T2I模型进行了多方面研究。通过该方法,我们获得了一个有趣的附加结果:图像质量可能因用作种子图像的潜在噪声而出现显著差异。我们还量化了提示中概念数量、顺序以及(颜色)属性的影响。最后,我们的方法能够识别出某些可生成更优图像的潜在种子,为这一研究不足的课题开辟了新的研究方向。