Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.
翻译:或许尚未。本文针对广受采用的大规模多任务语言理解(MMLU)基准测试中的错误进行了识别与分析。尽管MMLU已被广泛采纳,我们的分析表明其存在大量标注事实错误,这些错误掩盖了大语言模型的真实能力。例如,我们发现病毒学子集中57%的试题存在错误。为解决此问题,我们提出了一套基于新型错误分类法的数据集错误识别综合框架。基于此,我们构建了MMLU-Redux——一个涵盖30个MMLU学科、包含3000道人工重标注试题的子集。通过MMLU-Redux的实证分析,我们揭示了与原始报告模型性能指标间的显著差异。研究结果强烈建议修正MMLU中存在的错误试题,以提升其作为基准测试的未来效用与可靠性。为此,我们开放MMLU-Redux以进行补充标注:https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux。