We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.
翻译:我们提出了MEGA-Bench,一个将多模态评估扩展至500余项现实世界任务的评估套件,旨在应对终端用户高度异构的日常使用场景。我们的目标是优化一组高质量数据样本,覆盖高度多样且丰富的多模态任务集,同时实现经济高效且准确的模型评估。具体而言,我们收集了来自16位专家标注者的505项现实任务,涵盖超过8,000个样本,以广泛覆盖多模态任务空间。与将这些任务统一为标准多选题格式(如MMMU、MMBench和MMT-Bench)不同,我们接纳了广泛的输出格式,如数字、短语、代码、\LaTeX、坐标、JSON、自由格式等。为适应这些格式,我们开发了40余种评估指标。与现有基准不同,MEGA-Bench提供跨多个维度(如应用领域、输入类型、输出格式、技能)的细粒度能力报告,允许用户深入交互和可视化模型能力。我们在MEGA-Bench上评估了多种前沿视觉-语言模型,以理解它们在这些维度上的能力表现。