Many researchers have identified distribution shift as a likely contributor to the reproducibility crisis in behavioral and biomedical sciences. The idea is that if treatment effects vary across individual characteristics and experimental contexts, then studies conducted in different populations will estimate different average effects. This paper uses ``generalizability" methods to quantify how much of the effect size discrepancy between an original study and its replication can be explained by distribution shift on observed unit-level characteristics. More specifically, we decompose this discrepancy into ``components" attributable to sampling variability (including publication bias), observable distribution shifts, and residual factors. We compute this decomposition for several directly-replicated behavioral science experiments and find little evidence that observable distribution shifts contribute appreciably to non-replicability. In some cases, this is because there is too much statistical noise. In other cases, there is strong evidence that controlling for additional moderators is necessary for reliable replication.
翻译:许多研究者认为分布偏移是行为科学和生物医学领域可重复性危机的一个可能成因。其基本逻辑是:若处理效应因个体特征和实验情境的不同而存在差异,那么在不同人群中开展的研究将估算出不同的平均效应。本文采用“可推广性”方法,定量分析原始研究与其复现研究之间效应量差异中,可由可观测单位层面特征的分布偏移所解释的部分。具体而言,我们将该差异分解为抽样变异性(含发表偏倚)、可观测分布偏移以及残余因素等“分量”。通过对若干直接复现的行为科学实验进行该分解计算,我们发现几乎无证据表明可观测分布偏移对不可复现性有显著贡献。在某些案例中,这是由于统计噪声过大;而在其他案例中,则有强证据表明需控制额外调节变量才能实现可靠复现。