Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.
翻译:文本到图像生成模型的快速发展及其在视觉内容创作中的部署,使得全面评估其性能并识别潜在偏差的重要性日益凸显。为生成兼具真实性、多样性、视觉吸引力且与给定提示一致的图像,研究人员和实践者常借助自动化指标实现可扩展且经济高效的性能分析。然而,常用指标往往未能充分反映人类偏好的多样性;即便是深入的人工评估也面临主观性挑战,尤其当评估标准的解读因地域和文化而异时。本研究通过大规模跨文化研究,探究非洲、欧洲和东南亚地区的标注者对来自主流公共API的真实图像与生成图像在地理代表性、视觉吸引力和一致性方面的感知差异。我们收集了超过65,000条图像标注数据及20份问卷反馈,将人工标注结果与常见自动化指标进行对比,发现人类偏好随地理位置呈现显著差异,而当前指标未能完全涵盖这种多样性。例如,不同地区的标注者常对某一地区夸张化、刻板化的描绘是否具有地理代表性持不同意见。此外,自动评估的有效性取决于其设置假设,例如特征提取器与人类对物体相似性感知的对齐程度,或用于评估基准的参考数据集中"吸引力"定义的适用范围。我们建议采取改进自动评估与人工评估的步骤。