Learnersourcing involves students generating and sharing learning resources with their peers. When learnersourcing multiple-choice questions, creating explanations for the generated questions is a crucial step as it facilitates a deeper understanding of the related concepts. However, it is often difficult for students to craft effective explanations due to limited subject understanding and a tendency to merely restate the question stem, distractors, and correct answer. To help scaffold this task, in this work we propose a self-reinforcement large-language-model framework, with the goal of generating and evaluating explanations automatically. Comprising three modules, the framework generates student-aligned explanations, evaluates these explanations to ensure their quality and iteratively enhances the explanations. If an explanation's evaluation score falls below a defined threshold, the framework iteratively refines and reassesses the explanation. Importantly, our framework emulates the manner in which students compose explanations at the relevant grade level. For evaluation, we had a human subject-matter expert compare the explanations generated by students with the explanations created by the open-source large language model Vicuna-13B, a version of Vicuna-13B that had been fine-tuned using our method, and by GPT-4. We observed that, when compared to other large language models, GPT-4 exhibited a higher level of creativity in generating explanations. We also found that explanations generated by GPT-4 were ranked higher by the human expert than both those created by the other models and the original student-created explanations. Our findings represent a significant advancement in enriching the learnersourcing experience for students and enhancing the capabilities of large language models in educational applications.
翻译:学习者生成内容(Learnersourcing)是指学生生成并与同伴共享学习资源的过程。在学习者生成多选题时,为所出题目创建解释是一个关键步骤,因为它有助于加深对相关概念的理解。然而,由于学生对学科的理解有限,且往往倾向于仅复述题干、干扰项和正确答案,他们很难创作出有效的解释。为辅助完成这项任务,本文提出了一种自我强化的大型语言模型框架,旨在自动生成并评估解释。该框架包含三个模块:生成符合学生水平的解释、评估这些解释以确保其质量,以及迭代优化解释。若某个解释的评估分数低于设定阈值,框架会对其进行迭代改进并重新评估。重要的是,我们的框架模拟了学生在相应年级水平上撰写解释的方式。在评估中,我们邀请人类学科专家将学生生成的解释与由开源大型语言模型Vicuna-13B、经我们方法微调的Vicuna-13B版本以及GPT-4生成的解释进行比较。我们发现,与其他大型语言模型相比,GPT-4在生成解释时展现出更高水平的创造力。同时,人类专家对GPT-4生成解释的评价高于其他模型生成的和学生原始创作的解释。我们的研究结果在丰富学习者生成内容体验以及提升大型语言模型在教育应用中的能力方面取得了显著进展。