The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
翻译:视觉生成模型的快速发展已超越传统评估方法,亟需采用视觉-语言模型作为代理评估者。本研究系统性地考察了主流绝对逐点评分标准在广泛视觉生成任务中的可靠性。分析表明,该范式因随机不一致性及与人类感知对齐性差而存在局限。为解决这些问题,我们提出GenArena——一个通过成对比较范式确保稳定且人类对齐评估的统一框架。关键发现表明,仅采用这种成对比较协议即可使现成的开源模型超越顶级专有模型。值得注意的是,该方法将评估准确率提升超过20%,与权威LMArena排行榜的斯皮尔曼相关系数达0.86,显著超越逐点方法0.36的相关性。基于GenArena,我们在多样化任务中对前沿视觉生成模型进行基准测试,为学界提供了严谨且自动化的视觉生成评估标准。