Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.
翻译:高效评估文本到图像模型的性能具有挑战性,因为这本质上需要主观判断和人类偏好,使得不同模型间的比较和技术水平的量化变得困难。利用Rapidata的技术,我们提出了一种高效的标注框架,该框架从多样化、全球化的标注者群体中获取人类反馈。我们的研究收集了超过200万条标注,覆盖4,512张图像,评估了四个主流模型(DALL-E 3、Flux.1、MidJourney和Stable Diffusion)在风格偏好、连贯性和文本-图像对齐方面的表现。我们证明,该方法使得基于大规模标注者群体对图像生成模型进行全面排名成为可能,并表明多样化的标注者人口统计学特征反映了世界人口分布,从而显著降低了偏见风险。