Recent advances in text-to-image synthesis have been enabled by exploiting a combination of language and vision through foundation models. These models are pre-trained on tremendous amounts of text-image pairs sourced from the World Wide Web or other large-scale databases. As the demand for high-quality image generation shifts towards ensuring content alignment between text and image, novel evaluation metrics have been developed with the aim of mimicking human judgments. Thus, researchers have started to collect datasets with increasingly complex annotations to study the compositionality of vision-language models and their incorporation as a quality measure of compositional alignment between text and image contents. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics and propose a new taxonomy for categorizing these metrics. We also review frequently adopted text-image benchmark datasets before discussing techniques to optimize text-to-image synthesis models towards quality and human preferences. Ultimately, we derive guidelines for improving text-to-image evaluation and discuss the open challenges and current limitations.
翻译:近期,通过利用基础模型结合语言与视觉,文本到图像合成技术取得了显著进展。这些模型基于从万维网或其他大规模数据库中获取的海量文本-图像对进行预训练。随着高质量图像生成需求逐渐转向确保文本与图像之间的内容对齐,研究人员开发了旨在模拟人类判断的新型评估指标。为此,研究者开始收集带有日益复杂标注的数据集,以研究视觉语言模型的组合性,并将其作为衡量文本与图像内容组合对齐质量的标准。本文全面综述了现有文本到图像评估指标,并提出了一种新的分类体系以对这些指标进行归类。此外,我们回顾了常用的文本-图像基准数据集,并讨论了优化文本到图像合成模型以提升质量与符合人类偏好的技术。最后,我们提出了改进文本到图像评估的指导原则,并探讨了当前面临的挑战与局限性。