Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.
翻译:阅读理解测试广泛应用于从教育评估到简化文本可读性测评等多个领域。然而,人工编制此类测试题并保障其质量既困难又耗时。本文探索如何利用大语言模型(LLMs)生成和评估多项选择题型的阅读理解题目。为此,我们整合了一个德语阅读理解题目数据集,并开发了一套面向人工与自动评估的新方案,其中包括一种名为“文本信息性”(text informativity)的度量指标,该指标基于题目可猜测性与可回答性。随后,我们利用该方案及数据集评估了Llama 2和GPT-4生成题目的质量。结果表明,两种模型在零样本设定下均能生成质量可接受的题目,但GPT-4的表现显著优于Llama 2。同时,我们证实通过引导大语言模型生成题目应答,可将其用于自动评估。在此场景中,基于GPT-4的评估结果与人工标注者的相似度最高。总体而言,基于大语言模型的零样本生成方法在生成与评估阅读理解测试题目方面展现出良好前景,尤其适用于缺乏大规模可用数据的语言。