Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existing methods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within course materials and their pedagogical implications. In this study, we compared the performance of a rule-based method we developed to a machine-learning based method utilizing GPT-4 for the task of automatically assessing multiple-choice questions based on 19 common item-writing flaws. By analyzing 200 student-generated questions from four different subject areas, we found that the rule-based method correctly detected 91% of the flaws identified by human annotators, as compared to 79% by GPT-4. We demonstrated the effectiveness of the two methods in identifying common item-writing flaws present in the student-generated questions across different subject areas. The rule-based method can accurately and efficiently evaluate multiple-choice questions from multiple domains, outperforming GPT-4 and going beyond existing metrics that do not account for the educational use of such questions. Finally, we discuss the potential for using these automated methods to improve the quality of questions based on the identified flaws.
翻译:存在编写缺陷的多项选择题会负面影响学生学习效果并扭曲分析结果。学生自主生成的问题中常存在此类缺陷,导致难以评估其质量及课堂适用性。现有评估方法多聚焦机器可读性指标,未考虑题目在课程材料中的预期用途及其教学内涵。本研究将自主开发的基于规则的方法与基于GPT-4的机器学习方法进行比较,基于19种常见编写缺陷对多项选择题进行自动评估。通过分析四个学科领域的200道学生生成题目发现:规则方法正确检测出人工标注员识别的91%缺陷,而GPT-4仅检测出79%。我们证实了两种方法在跨学科识别学生生成题目中常见编写缺陷的有效性。规则方法能准确高效地评估多领域多项选择题,其性能优于GPT-4,且超越了未考虑题目教育用途的现有评价指标。最后,我们探讨了基于识别出的缺陷,利用这些自动化方法提升题目质量的潜力。