Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.
翻译:机器阅读理解(MRC)是评估自然语言理解能力的关键任务。现有的MRC数据集主要评估阅读理解的特定方面,缺乏全面的MRC基准。为填补这一空白,我们首先提出了一种新颖的分类法,对阅读理解所需的关键能力进行系统归类。基于此分类法,我们构建了MRCEval——一个利用先进大语言模型(LLMs)同时作为样本生成器和筛选评判器的MRC基准。MRCEval是一个全面、具有挑战性且易于使用的基准,旨在系统评估LLMs的阅读理解能力,涵盖13种不同的RC技能,共包含2.1K个高质量选择题。我们对28个广泛使用的开源及商业模型进行了全面评估,结果表明即使在LLM时代,MRC仍然构成重大挑战。