MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang,Xueguang Ma,Ge Zhang,Yuansheng Ni,Abhranil Chandra,Shiguang Guo,Weiming Ren,Aaran Arulraj,Xuan He,Ziyan Jiang,Tianle Li,Max Ku,Kai Wang,Alex Zhuang,Rongqi Fan,Xiang Yue,Wenhu Chen

from arxiv, This version has been accepted and published at NeurIPS 2024 Track Datasets and Benchmarks (Spotlight)

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

翻译：在大规模语言模型时代，诸如大规模多任务语言理解（MMLU）等基准在推动人工智能跨领域语言理解与推理能力边界方面发挥了关键作用。然而，随着模型性能的持续提升，其在现有基准上的表现开始趋于饱和，使得区分模型能力差异变得日益困难。本文介绍了MMLU-Pro，这是一个增强型数据集，旨在通过整合更具挑战性、以推理为核心的问题，并将选项集从四个扩展至十个，来拓展原以知识驱动为主的MMLU基准。此外，MMLU-Pro剔除了MMLU中琐碎和噪声较多的问题。我们的实验结果表明，MMLU-Pro不仅提高了挑战性，导致模型准确率相较于MMLU显著下降16%至33%，而且在不同的提示下表现出更高的稳定性。在测试了24种不同提示风格后，模型得分对提示变化的敏感性从MMLU的4-5%降至MMLU-Pro的仅2%。此外，我们发现采用思维链推理的模型在MMLU-Pro上相比直接回答取得了更好的性能，这与原始MMLU的发现形成鲜明对比，表明MMLU-Pro包含了更复杂的推理问题。我们的评估证实，MMLU-Pro是一个更具区分度的基准，能够更好地追踪该领域的进展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日