In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
翻译:在大规模语言模型时代,诸如大规模多任务语言理解(MMLU)等基准测试在推动人工智能跨领域语言理解与推理能力边界方面发挥了关键作用。然而,随着模型性能持续提升,其在现有基准上的表现已开始趋于饱和,导致模型能力差异愈发难以区分。本文提出了MMLU-Pro,这是一个增强型数据集,旨在通过整合更具挑战性、以推理为核心的问题,并将选项数量从四个扩展至十个,从而改进当前以知识驱动为主的MMLU基准。此外,MMLU-Pro剔除了MMLU中琐碎及存在噪声的问题。实验结果表明,MMLU-Pro不仅显著提升了挑战性——相较于MMLU导致模型准确率下降16%至33%,而且在多样化提示下表现出更高的稳定性。通过对24种不同提示风格的测试,模型得分对提示变化的敏感度从MMLU的4-5%降至MMLU-Pro的仅2%。此外,我们发现采用思维链(CoT)推理的模型在MMLU-Pro上相较于直接回答取得了更优性能,这与原始MMLU的结论形成鲜明对比,表明MMLU-Pro包含了更复杂的推理问题。我们的评估证实,MMLU-Pro是一个更具区分度的基准,能更有效地追踪该领域的发展进程。