Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks and have garnered significant attention from both researchers and practitioners. However, in an educational context, we still observe a performance gap in generating distractors -- i.e., plausible yet incorrect answers -- with LLMs for multiple-choice questions (MCQs). In this study, we propose a strategy for guiding LLMs such as ChatGPT, in generating relevant distractors by prompting them with question items automatically retrieved from a question bank as well-chosen in-context examples. We evaluate our LLM-based solutions using a quantitative assessment on an existing test set, as well as through quality annotations by human experts, i.e., teachers. We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is, outperforming the state-of-the-art model. We also show the gains of our approach 1 in generating high-quality distractors by comparing it with a zero-shot ChatGPT and a few-shot ChatGPT prompted with static examples.
翻译:大型语言模型(LLMs),如ChatGPT,已在多种任务中展现出卓越性能,并引起了研究人员和实践者的广泛关注。然而,在教育领域,我们仍观察到LLMs在为多项选择题(MCQs)生成干扰项(即看似合理但错误的答案)方面存在性能差距。本研究提出一种引导LLMs(如ChatGPT)的策略,通过提示模型从题库中自动检索的问题项以及精心选择的上下文示例,生成相关干扰项。我们利用现有测试集进行定量评估,并通过人类专家(即教师)的质量标注,对基于LLM的解决方案进行评价。研究发现,呈现给教师的干扰项中,平均53%被评定为高质量(即适合直接使用),超越了当前最先进模型。通过将本方法与零样本ChatGPT及使用静态示例提示的少样本ChatGPT进行对比,我们还展示了本方法在生成高质量干扰项方面的优势。