This study explores the effectiveness of Large Language Models (LLMs) for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.
翻译:本研究探讨了大型语言模型在教育场景中用于自动问题生成的有效性。我们比较了三种大型语言模型在不进行微调的情况下,根据大学课件文本生成问题的能力。问题通过两步流程获取:首先,使用Llama 2-Chat 13B从课件中提取答案短语;随后,三种模型针对每个答案生成问题。为分析这些问题是否适合学生教育应用,我们对46名学生开展了调查,共评估246个问题,涵盖五个指标:清晰度、相关性、难度、与课件关联度以及问题-答案匹配度。结果表明,GPT-3.5和Llama 2-Chat 13B以较小优势优于Flan T5 XXL,尤其在清晰度和问题-答案匹配度方面。GPT-3.5特别擅长根据输入答案定制问题。本研究的贡献在于分析了大型语言模型在教育领域自动问题生成的能力。