The Difference Between "Replicable" and "Not replicable" is not Itself Scientifically Replicable

Replication studies estimate the replicability rate of scientific results by aggregating binary verdicts of experiments. Exact replications are rarely attainable, so most replication sequences are non-exact. Experiments differ in ways that matter and do not share a single data-generating process. We formalize two statistical interpretations of non-exactness. In a shared latent rate (benchmark) model, experiments are exchangeable and depend on a common random replicability rate. In a conditionally independent rates (operational) model, each experiment has its own replicability rate drawn from a population distribution. Under the benchmark model, even small variability among replicability rates induces an irreducible variance floor on the estimated mean replicability rate that no amount of replication can eliminate. Under the operational model, the degree of non-exactness is not identifiable from standard replication data, because one binary verdict per experiment carries no information about between-experiment heterogeneity. Researchers cannot tell which precision regime they are in or whether high- and low-replicability sequences can be distinguished in principle. The usual data structure cannot support reliable demarcation between "replicable" and "not replicable" results and systematically understates uncertainty, making high- and low-replicability sequences appear discriminable when they are not. We show how common sources of heterogeneity amplify these problems and demonstrate practical consequences in a reanalysis of Many Labs 4. Aggregating replicability rates across heterogeneous literatures produces averages that conflate incommensurable regimes and lack a stable interpretation. Replicability rate is not a reliable demarcation criterion. The replication crisis, if there is one, cannot be established by the methods used to declare it.

翻译：复制研究通过汇总实验的二元判定来估计科学成果的可复制率。精确复制几乎难以实现，因此大多数复制序列都是非精确的。实验之间存在实质性差异，且不共享单一的数据生成过程。我们对非精确性提出了两种统计解释：在共享潜在率（基准）模型中，实验是可交换的，并依赖于一个共同的随机可复制率；在条件独立率（操作）模型中，每个实验的可复制率均从总体分布中抽取。在基准模型下，即使可复制率之间存在微小变异性，也会导致估计的平均可复制率出现不可消除的不可约方差下限，且任何数量的复制都无法将其消除。在操作模型中，非精确性的程度无法通过标准复制数据识别，因为每个实验仅提供一个二元判定，无法携带关于实验间异质性的信息。研究人员无法判断自己处于哪种精度机制中，也无法在原则上区分高可复制性序列与低可复制性序列。常规数据结构无法支撑“可复制”与“不可复制”结果之间的可靠分界，并且系统性地低估不确定性，使得高、低可复制性序列看似可区分，实则不然。我们展示了常见的异质性来源如何放大这些问题，并在对“多实验室4”的再分析中说明其实际后果。跨异质性文献汇总可复制率会产生平均值，这混淆了不可通约的机制，且缺乏稳定的解释。可复制率并非可靠的分界标准。复制危机——如果存在的话——也无法通过宣布该危机所使用的方法来证实。