Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of 49.19 across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.
翻译:生成式人工智能在图像和视频生成等领域取得了显著进展,正在引发革命性变革。这些进步由创新的算法、架构和数据驱动。然而,生成模型的快速扩散凸显了一个关键空白:缺乏可信赖的评估指标。当前的自动评估方法,如FID、CLIP、FVD等,往往无法捕捉生成输出所关联的细微质量差异和用户满意度。本文提出了一个开放平台GenAI-Arena,用于评估不同的图像和视频生成模型,用户可以积极参与对这些模型的评估。通过利用集体用户反馈和投票,GenAI-Arena旨在提供一个更民主、更准确的模型性能衡量标准。它分别涵盖了文本到图像生成、文本到视频生成和图像编辑三项任务。目前,我们共涵盖了35个开源生成模型。GenAI-Arena已运行七个月,积累了来自社区的超过9000次投票。我们描述了我们的平台,分析了数据,并解释了用于模型排名的统计方法。为了进一步推动基于模型的评估指标研究,我们发布了针对三项任务的、经过清理的偏好数据集,即GenAI-Bench。我们提示现有的多模态模型,如Gemini和GPT-4o,来模拟人类投票。通过比较模型投票与人类投票来计算准确率,以了解它们的判断能力。我们的结果表明,现有的多模态模型在评估生成的视觉内容方面仍然落后,即使是最好的模型GPT-4o,在三项生成任务中的平均准确率也仅为49.19%。由于在复杂视觉场景中缺乏指令遵循和推理能力,开源的多模态大语言模型表现更差。