Dependencies in Item-Adaptive CAT Data and Differential Item Functioning Detection: A Multilevel Framework

Differential item functioning (DIF) detection is an important yet understudied problem in computerized adaptive testing (CAT). In this article, we proposed a two-level logistic model to improve DIF detection in CAT by explicitly accounting for nuisance effects arising from CAT-induced structural dependency. First, we conceptualized that adaptive item selection induces systematic dependencies among examinees and items through provisional ability estimates, whereas traditional single-level DIF methods assume independent observations and may yield misleading results in CAT settings. Then, using a numeric example and Monte Carlo simulations, we compared our proposed two-level model with competing single-level models under various CAT conditions, manipulating test length, exposure control, ability estimator, DIF type, and DIF prevalence. Item-level Type-I error and statistical power conditional on joint model convergence were reported for each model. We showed that the proposed two-level model has improved control of spurious DIF and competitive power relative to single-level models, particularly with shorter tests and smaller exposure rates. However, we observed that the model convergence varied systematically across simulated conditions, highlighting that inferential accuracy and convergence reliability are intertwined in complex CAT DIF settings. Through this study, we underscored both the promise of multilevel DIF modeling in CAT and the need for future research to jointly evaluate convergence and inferential performance when assessing DIF models.

翻译：摘要：差异项目功能检测是计算机自适应测试中一个重要但研究不足的问题。本文通过显式处理由自适应测试引起的结构性依赖产生的干扰效应，提出了一种两水平逻辑模型来改进自适应测试中的差异项目功能检测。首先，我们论证了自适应项目选择通过临时能力估计引入了考生与项目之间的系统依赖性，而传统的单水平差异项目功能方法假设观测值独立，在自适应测试环境下可能产生误导性结果。然后，通过数值实例和蒙特卡洛模拟，我们在多种自适应测试条件下（操纵测试长度、曝光控制、能力估计方法、差异项目功能类型及差异项目功能发生率）将所提出的两水平模型与竞争性单水平模型进行了比较。报告了各模型在联合模型收敛条件下的项目水平第Ⅰ类错误率和统计检验力。研究表明，相较于单水平模型，所提出的两水平模型在控制虚假差异项目功能方面表现更优，且统计检验力具有竞争力，尤其在测试长度较短、曝光率较小的条件下。然而，我们观察到模型收敛性随模拟条件系统性变化，揭示了在复杂的自适应测试差异项目功能情境中，推断准确性与收敛可靠性相互交织。通过本研究，我们既强调了多层次差异项目功能模型在自适应测试中的应用前景，也指出了未来研究需在评估差异项目功能模型时联合考察收敛性与推断性能的必要性。