Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski,Max Bain,Matthew Henderson,Zhongkai Zhu,Nishant Relan,Hai Pham,Donovan Ong,Kaloyan Aleksiev,Aitor Ormazabal,Samuel Phua,Ethan Yeo,Eugenie Lamprecht,Qi Liu,Yuqi Wang,Eric Chen,Deyu Fu,Lei Li,Che Zheng,Cyprien de Masson d'Autume,Dani Yogatama,Mikel Artetxe,Yi Tay

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval

翻译：我们推出Vibe-Eval：一个用于评估多模态对话模型的新开放基准测试与框架。Vibe-Eval包含269个视觉理解提示，其中100个为高难度题目，并配有专家撰写的黄金标准参考答案。Vibe-Eval具有开放性和挑战性，兼具双重目标：(i) 对多模态对话模型在日常任务中的表现进行快速检验，以及(ii) 严格测试和探究当前前沿模型的能力。值得注意的是，我们的困难集合中超过50%的问题所有前沿模型均回答错误。我们探讨了在极具挑战性提示上设计、评估和排序模型的细微差别。同时讨论了人工评估与自动评估之间的权衡，并展示了使用Reka Core进行的自动模型评估与人工判断大致相关。我们提供免费的API接口以支持轻量级评估，并计划对在Vibe-Eval自动评分中表现优异的公开模型开展正式人工评估。我们已发布评估代码与数据，详见https://github.com/reka-ai/reka-vibe-eval。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日