Recent AI-based text-to-image models not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques that offer precise control over scene parameters such as objects, materials, and lighting, when generating realistic images. While the quality of rendered images is traditionally assessed through well-established image quality metrics, such as SSIM or PSNR, the unique challenges presented by text-to-image models, which in contrast to rendering interweave the control of scene and rendering parameters, necessitate the development of novel image quality metrics. Therefore, within this survey, we provide a comprehensive overview of existing text-to-image quality metrics addressing their nuances and the need for alignment with human preferences. Based on our findings, we propose a new taxonomy for categorizing these metrics, which is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.
翻译:近年来,基于人工智能的文本到图像模型不仅能够生成逼真的图像,还使设计者能够对图像内容进行日益精细的控制。因此,这些方法在计算机图形学研究领域获得了越来越多的关注,而该领域历来致力于研究传统的渲染技术——这些技术在生成逼真图像时能够对物体、材质和光照等场景参数提供精确控制。虽然渲染图像的质量传统上通过SSIM或PSNR等成熟的图像质量度量标准进行评估,但文本到图像模型带来的独特挑战(与渲染技术不同,它将场景参数和渲染参数的控制交织在一起)催生了新型图像质量度量标准的必要性。因此,在本综述中,我们全面梳理了现有的文本到图像质量度量方法,深入探讨了其细微差别以及与人类偏好对齐的需求。基于研究结果,我们提出了一种新的度量标准分类体系,其基本假设是存在两个主要的质量标准:组合性与泛化性,这两者理想情况下应与人类偏好相对应。最后,我们为从事文本到图像评估的实践者制定了指导原则,讨论了评估机制面临的开放挑战,并揭示了当前度量标准的局限性。