Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a direct estimation method that prompted the LLM to assign a single difficulty rating to each item, and (b) a feature-based strategy where the LLM extracted multiple cognitive and linguistic features, which were then used in ensemble tree-based models (random forests and gradient boosting) to predict difficulty. Overall, direct LLM estimates showed moderate to strong correlations with true item difficulties. However, their accuracy varied by grade level, often performing worse for early grades. In contrast, the feature-based method yielded stronger predictive accuracy, with correlations as high as r = 0.87 and lower error estimates compared to both direct LLM predictions and baseline regressors. These findings highlight the promise of LLMs in streamlining item development and reducing reliance on extensive field testing and underscore the importance of structured feature extraction. We provide a seven-step workflow for testing professionals who would want to implement a similar item difficulty estimation approach with their item pool.

翻译：通过实地测试来估计题目难度通常需要大量资源和时间。因此，开发仅基于题目内容即可大规模预测题目难度的方法具有强烈的现实需求。大型语言模型为实现这一目标提供了新的前沿途径。本研究探讨了使用LLM预测K-5年级数学与阅读评估题目难度（N = 5170）的可行性。我们实施了两种估计方法：（a）直接估计法：提示LLM为每道题目分配单一难度评分；（b）基于特征的策略：由LLM提取多项认知与语言特征，随后将其输入基于树的集成模型（随机森林与梯度提升）中进行难度预测。总体而言，LLM直接估计结果与真实题目难度呈现中度至强相关性，但其准确性随年级变化，在低年级表现往往较差。相比之下，基于特征的方法展现出更强的预测准确性，其相关性高达r = 0.87，且误差估计低于直接LLM预测及基线回归模型。这些发现凸显了LLM在简化题目开发流程、减少对大规模实地测试依赖方面的潜力，并强调了结构化特征提取的重要性。我们为测试专业人员提供了七步工作流程，以帮助其在自有题库中实施类似的题目难度估计方法。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

结合知识增强的大型语言模型复杂问题求解综述

专知会员服务

16+阅读 · 2025年5月7日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日