Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.
翻译:阅读理解测试广泛应用于从教育到评估简化文本易懂性的多种场景。然而,手动编制此类测试并确保其质量既困难又耗时。本文探讨了如何利用大型语言模型(LLMs)生成和评估多项选择题形式的阅读理解题目。为此,我们整理了一个德语阅读理解题目数据集,并开发了一套全新的人工与自动评估协议,其中包括我们提出的基于可猜测性与可回答性的文本信息量指标。随后,我们利用该协议与数据集评估了Llama 2与GPT-4生成题目的质量。结果表明,在零样本设置下,两种模型均能生成可接受质量的题目,但GPT-4明显优于Llama 2。我们还证明,通过引导LLMs生成题目反馈,可将其用于自动评估。在此场景中,GPT-4的评估结果与人工标注者最为接近。总体而言,基于LLMs的零样本生成是生成与评估阅读理解测试题目的一种有前景的方法,尤其适用于缺乏大规模可用数据的语言。