This paper addresses the situation in which treatment effects are reported using educational or psychological outcome measures comprised of multiple questions or "items." A distinction is made between a treatment effect on the construct being measured, which is referred to as impact, and item-specific treatment effects that are not due to impact, which are referred to as differential item functioning (DIF). By definition, impact generalizes to other measures of the same construct (i.e., measures that use different items), while DIF is dependent upon the specific items that make up the outcome measure. To distinguish these two cases, two estimators of impact are compared: an estimator that naively aggregates over items, and a less efficient one that is highly robust to DIF. The null hypothesis that both are consistent estimators of the true treatment impact leads to a Hausman-like specification test of whether the naive estimate is affected by item-level variation that would not be expected to generalize beyond the specific outcome measure used. The performance of the test is illustrated with simulation studies and a re-analysis of 34 item-level datasets from 22 randomized evaluations of educational interventions. In the empirical example, the dependence of reported effect sizes on the type of outcome measure (researcher-developed or independently developed) was substantially reduced after accounting for DIF. Implications for the ongoing debate about the role of researcher-developed assessments in education sciences are discussed.
翻译:本文探讨了使用由多个问题或"项目"构成的教育或心理结果测量指标来报告治疗效应的情况。研究区分了治疗对被测构念的效应(称为影响)与并非由影响导致的特定项目治疗效应(称为差异项目功能)。根据定义,影响可推广至同一构念的其他测量指标(即使用不同项目的测量),而差异项目功能则取决于构成结果测量指标的具体项目。为区分这两种情况,本文比较了两种影响估计量:一种是对项目进行简单聚合的估计量,另一种是效率较低但对差异项目功能具有高度稳健性的估计量。两种估计量都是真实治疗影响的一致性估计量这一零假设,引出了一个类似豪斯曼的设定检验,用于检验简单估计是否受到项目层面变异的影响——这种变异预计不会推广到所使用的特定结果测量指标之外。通过模拟研究和对22项教育干预随机评估中34个项目层面数据集的重新分析,展示了该检验的性能。在实证案例中,在考虑差异项目功能后,报告效应量对结果测量类型(研究者开发或独立开发)的依赖性显著降低。本文还讨论了这一发现对当前关于研究者开发评估在教育科学中作用的持续辩论的启示。