Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. Typically, the difficulty is determined by having human test takers trial the questions in a pretesting stage. However, this is expensive and not scalable. Therefore, we explore automated approaches to rank MC questions by difficulty. However, there is limited data for explicit training of a system for difficulty scores. Hence, we compare task transfer and zero-shot approaches: task transfer adapts level classification and reading comprehension systems for difficulty ranking while zero-shot prompting of instruction finetuned language models contrasts absolute assessment against comparative. It is found that level classification transfers better than reading comprehension. Additionally, zero-shot comparative assessment is more effective at difficulty ranking than the absolute assessment and even the task transfer approaches at question difficulty ranking with a Spearman's correlation of 40.4%. Combining the systems is observed to further boost the correlation.
翻译:多项选择测试是评估英语学习者的有效方法。在试题编制过程中,帮助出题者按难度对候选多选题进行排序具有实用价值。通常,难度通过人工预测试阶段让受试者试答题目来确定,但这种方法成本高且不可扩展。为此,我们探索了自动化排序多选题难度的方法。然而,用于显式训练难度评分系统的数据有限。因此,我们比较了任务迁移与零样本方法:任务迁移将级别分类和阅读理解系统适配为难度排序,而基于指令微调语言模型的零样本提示则对比绝对评估与比较评估。研究发现,级别分类的迁移效果优于阅读理解。此外,零样本比较评估在难度排序中比绝对评估甚至任务迁移方法更有效,其斯皮尔曼相关系数达到40.4%。将各系统结合使用可进一步提升相关性。