基于大语言模型的推理与采样增强型多选题难度预测 (Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs)

The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.

翻译：多选题（MCQ）的难度是教育评估中的关键因素。预测多选题难度具有挑战性，因为它既需要理解得出正确选项的复杂性，也需要评估干扰项（即错误选项）的合理性。本文提出一种新颖的两阶段方法来预测多选题的难度。首先，为了更好地估计每道多选题的复杂性，我们使用大语言模型（LLM）来增强得出每个选项所需的推理步骤。我们不仅将多选题本身，还将这些推理步骤作为输入来预测难度。其次，为了捕捉干扰项的合理性，我们从分布中采样知识水平，以考虑回答该多选题的学生之间的差异。受项目反应理论（IRT）启发，这种设置使我们能够估计学生选择每个（包括正确和错误）选项的可能性。我们使用基于Kullback-Leibler（KL）散度的正则化目标将这些预测与其真实值对齐，并利用估计的可能性来预测多选题难度。我们在两个具有通过IRT估计的真实难度值的现实世界数学多选题及作答数据集上评估了我们的方法。实验结果表明，我们的方法优于所有基线，均方误差降低高达28.3%，决定系数提升高达34.6%。我们还定性地讨论了我们的新方法如何实现更高的多选题难度预测准确度。