Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.
翻译:视觉语言生成奖励模型(VL-GenRMs)在对齐和评估多模态人工智能系统中发挥着关键作用,然而其自身的评估仍未被充分探索。当前的评估方法主要依赖于从传统视觉语言任务中获取的AI标注偏好标签,这可能引入偏见,并且往往难以有效挑战最先进的模型。为应对这些局限性,我们提出了VL-RewardBench,这是一个涵盖通用多模态查询、视觉幻觉检测和复杂推理任务的综合性基准。通过我们结合样本选择与人工验证的AI辅助标注流程,我们精心构建了1,250个高质量示例,专门用于探测模型的局限性。对16个领先的大型视觉语言模型的全面评估表明,VL-RewardBench作为一个挑战性测试平台是有效的,即使是GPT-4o也仅达到65.4%的准确率,而最先进的开源模型,如Qwen2-VL-72B,也难以超越随机猜测。重要的是,VL-RewardBench上的性能与使用VL-GenRMs进行Best-of-N采样所得的MMMU-Pro准确率表现出强相关性(皮尔逊相关系数r > 0.9)。分析实验揭示了改进VL-GenRMs的三个关键见解:(i)模型主要在基础视觉感知任务上失败,而非推理任务;(ii)推理时扩展的收益因模型能力差异巨大;(iii)训练VL-GenRMs学习判断能显著提升其判断能力(对于一个7B参数的VL-GenRM,准确率提升了+14.7%)。我们相信,VL-RewardBench以及这些实验见解将成为推进VL-GenRMs发展的宝贵资源。