Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.
翻译:项目反应理论(IRT)是一类可解释的因子模型,广泛应用于计算机化自适应测试(CAT),如语言能力测试。传统上,这些模型通过参数化混合效应模型拟合考生对测试项目(即试题)的正确回答概率。此类模型的神经网络扩展(如BertIRT)需要专用架构和参数调优。我们提出一种与开箱即用的自动化机器学习(AutoML)工具兼容的多阶段拟合流程。该方法基于蒙特卡洛期望最大化(MCEM)外循环和两阶段内循环:首先利用项目特征训练非参数化AutoML评分模型,随后训练项目特定的参数化模型。这显著加速了测试评分的建模工作流程。我们通过将其应用于高风险在线英语能力测试——多邻国英语测试,验证了该方法的有效性。结果表明,相较于现有方法(非解释性IRT模型及BERT-IRT等解释性IRT模型),所得模型通常具有更好的校准性、更优的预测性能和更准确的评分。此外,本文还简要综述了用于CAT项目参数校准的机器学习方法。