Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.
翻译:生成式人工智能在图像和视频生成等领域取得了显著进展,正在彻底改变这些领域。这些进步是由创新的算法、架构和数据驱动的。然而,生成模型的快速涌现突显了一个关键缺口:缺乏可信赖的评估指标。当前的自动评估方法,如FID、CLIP、FVD等,往往无法捕捉生成输出所关联的细微质量差异和用户满意度。本文提出了一个开放平台GenAI-Arena,用于评估不同的图像和视频生成模型,用户可以积极参与对这些模型的评估。通过利用集体用户反馈和投票,GenAI-Arena旨在提供一个更民主、更准确的模型性能衡量标准。它分别涵盖了文本到图像生成、文本到视频生成和图像编辑三个竞技场。目前,我们共涵盖了27个开源生成模型。GenAI-Arena已运行四个月,积累了来自社区的超过6000票。我们描述了我们的平台,分析了数据,并解释了用于对模型进行排名的统计方法。为了进一步推动基于模型的评估指标研究,我们发布了针对这三项任务的偏好数据的清理版本,即GenAI-Bench。我们提示现有的多模态模型,如Gemini、GPT-4o,来模拟人类投票。我们计算了模型投票与人类投票之间的相关性,以了解它们的判断能力。我们的结果表明,现有的多模态模型在评估生成的视觉内容方面仍然落后,即使是最好的模型GPT-4o,在质量子分数上也仅达到0.22的皮尔逊相关系数,在其他方面表现如同随机猜测。