GenAI Arena: An Open Evaluation Platform for Generative Models

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

翻译：生成式人工智能在图像和视频生成等领域取得了显著进展，正在彻底改变这些领域。这些进步是由创新的算法、架构和数据驱动的。然而，生成模型的快速涌现突显了一个关键缺口：缺乏可信赖的评估指标。当前的自动评估方法，如FID、CLIP、FVD等，往往无法捕捉生成输出所关联的细微质量差异和用户满意度。本文提出了一个开放平台GenAI-Arena，用于评估不同的图像和视频生成模型，用户可以积极参与对这些模型的评估。通过利用集体用户反馈和投票，GenAI-Arena旨在提供一个更民主、更准确的模型性能衡量标准。它分别涵盖了文本到图像生成、文本到视频生成和图像编辑三个竞技场。目前，我们共涵盖了27个开源生成模型。GenAI-Arena已运行四个月，积累了来自社区的超过6000票。我们描述了我们的平台，分析了数据，并解释了用于对模型进行排名的统计方法。为了进一步推动基于模型的评估指标研究，我们发布了针对这三项任务的偏好数据的清理版本，即GenAI-Bench。我们提示现有的多模态模型，如Gemini、GPT-4o，来模拟人类投票。我们计算了模型投票与人类投票之间的相关性，以了解它们的判断能力。我们的结果表明，现有的多模态模型在评估生成的视觉内容方面仍然落后，即使是最好的模型GPT-4o，在质量子分数上也仅达到0.22的皮尔逊相关系数，在其他方面表现如同随机猜测。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日