Estimating Item Difficulty with Large Language Models as Experts

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.

翻译：准确的题目难度估计对于有效评估和自适应学习至关重要。然而，对于新创建的任务，通常无法获取作答数据。预测试和专家判断可能成本高昂且速度缓慢，而机器学习方法通常需要大量标注训练数据集。近期研究表明，大语言模型可能对此有所帮助。然而，关于用于模拟专家进行难度估计的提示程序及配置的证据有限。本研究通过评估三种现成的大语言模型作为新创建题目（无作答数据）的难度评分者，填补了这一空白。研究利用在线学习系统中的一个题库，考察了小学数学的6个领域，并将经验难度估计作为实证参考。研究采用全因子设计，交叉三个因素：判断格式（绝对判断vs成对比较）、决策类型（硬决策vs基于标记概率的估计）和提示策略（零样本vs少样本）。将大语言模型得出的难度估计与经验难度进行斯皮尔曼秩相关比较。跨领域来看，基于大语言模型的估计与经验题目难度呈现出中等到强的正相关。对于较简单的算术任务，某些配置接近之前研究报告的人类专家的准确度上限。在缺乏额外优化的情况下，成对比较始终优于绝对判断。然而，当引入标记级概率并提供已知经验难度的题目示例时，绝对判断配置同样展现出中等到高的一致性。本研究将大语言模型定位为初始题目校准的有前景工具，并为有效的工作流配置提供了见解。