VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

翻译：视觉语言生成奖励模型（VL-GenRMs）在对齐和评估多模态人工智能系统中发挥着关键作用，然而其自身的评估仍未被充分探索。当前的评估方法主要依赖于从传统视觉语言任务中获取的AI标注偏好标签，这可能引入偏见，并且往往难以有效挑战最先进的模型。为应对这些局限性，我们提出了VL-RewardBench，这是一个涵盖通用多模态查询、视觉幻觉检测和复杂推理任务的综合性基准。通过我们结合样本选择与人工验证的AI辅助标注流程，我们精心构建了1,250个高质量示例，专门用于探测模型的局限性。对16个领先的大型视觉语言模型的全面评估表明，VL-RewardBench作为一个挑战性测试平台是有效的，即使是GPT-4o也仅达到65.4%的准确率，而最先进的开源模型，如Qwen2-VL-72B，也难以超越随机猜测。重要的是，VL-RewardBench上的性能与使用VL-GenRMs进行Best-of-N采样所得的MMMU-Pro准确率表现出强相关性（皮尔逊相关系数r > 0.9）。分析实验揭示了改进VL-GenRMs的三个关键见解：（i）模型主要在基础视觉感知任务上失败，而非推理任务；（ii）推理时扩展的收益因模型能力差异巨大；（iii）训练VL-GenRMs学习判断能显著提升其判断能力（对于一个7B参数的VL-GenRM，准确率提升了+14.7%）。我们相信，VL-RewardBench以及这些实验见解将成为推进VL-GenRMs发展的宝贵资源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日