In the item response theory (IRT) literature, differential test functioning (DTF) has been conceptualized in terms of how the test response function differs over groups of respondents. This paper presents an alternative approach to DTF that focusses on how the distribution of the latent trait differs over groups, which is referred to as impact. It is proposed to evaluate DTF by comparing two estimates of impact, one that naively aggregates over all test items and a robust alternative that down-weights items that exhibit differential item functioning (DIF). Taking this approach, this paper makes the following three contributions. First it is shown that the difference between the naive and robust estimands provides a convenient effect size for quantifying the extent to which DIF affects conclusions about impact (as opposed to test scores). Second it is shown how to construct a robust estimator that yields consistent estimates of impact whenever fewer than 1/2 of items exhibit DIF. Third, a relatively general purpose Wald test of the difference between two estimates of impact is developed. Using simulations and an empirical example from physics education, it is shown how the proposed effect size and test statistic perform using the proposed robust estimator of impact, as well as estimators that arise from conventional item-by-item tests of DIF.
翻译:在项目反应理论(IRT)文献中,差异测验功能(DTF)通常被概念化为测验反应函数在不同受访者群体间的差异。本文提出了一种DTF的替代方法,其关注点在于潜在特质的分布如何在不同群体间产生差异,这种差异被称为影响。本文建议通过比较两种影响的估计值来评估DTF:一种是简单聚合所有测验项目的朴素估计,另一种则是稳健估计,该估计会降低那些表现出差异项目功能(DIF)的项目的权重。采用这一方法,本文作出了以下三项贡献。首先,研究表明朴素估计量与稳健估计量之间的差值提供了一个便捷的效应量,可用于量化DIF在多大程度上影响关于影响(而非测验分数)的结论。其次,研究展示了如何构建一个稳健估计量,该估计量能在少于1/2的项目存在DIF时,仍能给出影响的一致估计。第三,本文开发了一种相对通用的沃尔德检验,用于检验两种影响估计值之间的差异。通过模拟研究和来自物理教育的一个实证案例,本文展示了所提出的效应量和检验统计量在使用所建议的稳健影响估计量时的表现,同时也展示了使用传统逐项DIF检验所产生的估计量的表现。