GenAI Arena: An Open Evaluation Platform for Generative Models

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of 49.19 across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.

翻译：生成式人工智能在图像和视频生成等领域取得了显著进展，正在引发革命性变革。这些进步由创新的算法、架构和数据驱动。然而，生成模型的快速扩散凸显了一个关键空白：缺乏可信赖的评估指标。当前的自动评估方法，如FID、CLIP、FVD等，往往无法捕捉生成输出所关联的细微质量差异和用户满意度。本文提出了一个开放平台GenAI-Arena，用于评估不同的图像和视频生成模型，用户可以积极参与对这些模型的评估。通过利用集体用户反馈和投票，GenAI-Arena旨在提供一个更民主、更准确的模型性能衡量标准。它分别涵盖了文本到图像生成、文本到视频生成和图像编辑三项任务。目前，我们共涵盖了35个开源生成模型。GenAI-Arena已运行七个月，积累了来自社区的超过9000次投票。我们描述了我们的平台，分析了数据，并解释了用于模型排名的统计方法。为了进一步推动基于模型的评估指标研究，我们发布了针对三项任务的、经过清理的偏好数据集，即GenAI-Bench。我们提示现有的多模态模型，如Gemini和GPT-4o，来模拟人类投票。通过比较模型投票与人类投票来计算准确率，以了解它们的判断能力。我们的结果表明，现有的多模态模型在评估生成的视觉内容方面仍然落后，即使是最好的模型GPT-4o，在三项生成任务中的平均准确率也仅为49.19%。由于在复杂视觉场景中缺乏指令遵循和推理能力，开源的多模态大语言模型表现更差。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日