We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval
翻译:我们推出Vibe-Eval:一个用于评估多模态对话模型的新开放基准测试与框架。Vibe-Eval包含269个视觉理解提示,其中100个为高难度题目,并配有专家撰写的黄金标准参考答案。Vibe-Eval具有开放性和挑战性,兼具双重目标:(i) 对多模态对话模型在日常任务中的表现进行快速检验,以及(ii) 严格测试和探究当前前沿模型的能力。值得注意的是,我们的困难集合中超过50%的问题所有前沿模型均回答错误。我们探讨了在极具挑战性提示上设计、评估和排序模型的细微差别。同时讨论了人工评估与自动评估之间的权衡,并展示了使用Reka Core进行的自动模型评估与人工判断大致相关。我们提供免费的API接口以支持轻量级评估,并计划对在Vibe-Eval自动评分中表现优异的公开模型开展正式人工评估。我们已发布评估代码与数据,详见https://github.com/reka-ai/reka-vibe-eval。