In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io
翻译:近年来,文本到图像(T2I)模型得到了广泛研究,尤其是随着在T2I合成任务中取得最先进结果的扩散模型的出现。然而,现有基准测试严重依赖主观人工评估,限制了其全面评估模型能力的效果。此外,开发新型T2I架构的努力与评估方法之间存在显著差距。为解决这一问题,我们提出了HRS-Bench,一个面向T2I模型的具体评估基准,具有全面性、可靠性和可扩展性。与仅关注有限方面的现有基准不同,HRS-Bench测量了13项技能,这些技能可归为五大类:准确性、鲁棒性、泛化能力、公平性和偏差。此外,HRS-Bench涵盖了50个场景,包括时尚、动物、交通、食品和服装。我们使用覆盖广泛技能的指标评估了九个近期大规模T2I模型。我们进行了人工评估,平均与95%的评估结果一致,以验证HRS-Bench的有效性。实验表明,现有模型在生成具有所需物体数量、视觉文本或合理情感的图像时,常常面临困难。我们希望该基准能助力未来的文本到图像生成研究。代码和数据见https://eslambakr.github.io/hrsbench.github.io。