When training and evaluating machine reading comprehension models, it is very important to work with high-quality datasets that are also representative of real-world reading comprehension tasks. This requirement includes, for instance, having questions that are based on texts of different genres and require generating inferences or reflecting on the reading material. In this article we turn our attention to RACE, a dataset of English texts and corresponding multiple-choice questions (MCQs). Each MCQ consists of a question and four alternatives (of which one is the correct answer). RACE was constructed by Chinese teachers of English for human reading comprehension and is widely used as training material for machine reading comprehension models. By construction, RACE should satisfy the aforementioned quality requirements and the purpose of this article is to check whether they are indeed satisfied. We provide a detailed analysis of the test set of RACE for high-school students (1045 texts and 3498 corresponding MCQs) including (1) an evaluation of the difficulty of each MCQ and (2) annotations for the relevant pieces of the texts (called "bases") that are used to justify the plausibility of each alternative. A considerable number of MCQs appear not to fulfill basic requirements for this type of reading comprehension tasks, so we additionally identify the high-quality subset of the evaluated RACE corpus. We also demonstrate that the distribution of the positions of the bases for the alternatives is biased towards certain parts of texts, which is not necessarily desirable when evaluating MCQ answering and generation models.
翻译:在训练和评估机器阅读理解模型时,使用高质量且能代表真实阅读理解任务的数据集至关重要。这一要求包括,例如,问题需基于不同体裁的文本,并需要生成推理或对阅读材料进行反思。本文聚焦于RACE数据集,该数据集包含英语文本及其对应的多项选择题。每道选择题由一个问题及四个选项(其中一个是正确答案)构成。RACE由中国的英语教师为人类阅读理解而构建,并广泛用作机器阅读理解模型的训练材料。理论上,RACE应满足上述质量要求,而本文旨在检验这些要求是否确实得到满足。我们对RACE高中学生测试集(包含1045篇文本及对应的3498道选择题)进行了详细分析,包括:(1) 评估每道选择题的难度;(2) 标注文本中用于证明每个选项合理性的相关片段(称为“依据”)。相当数量的选择题似乎未能满足此类阅读理解任务的基本要求,因此我们进一步识别了所评估RACE语料库中的高质量子集。我们还证明,各选项依据在文本中的位置分布偏向于特定部分,这在评估选择题回答和生成模型时未必理想。