How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests

There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.

翻译：零假设显著性检验中存在一个众所周知的问题：许多统计显著的结果在后续实验中无法复制。我们表明，这一问题源于标准的“点零假设”显著性检验仅考虑实验内变异，却忽略了实验间变异，因此系统性地低估了结果中的随机变异程度。我们提出了标准显著性检验的一种扩展方法，通过同时分析实验内和实验间变异来解决这一问题。这种“分布零假设”方法不会低估实验变异性，因此在判定显著性时不会过度自信；由于该方法考虑了实验间变异，它能对显著性结果的可复制概率给出数学上一致的估计。利用大规模复制数据集（首个“多实验室”项目），我们表明，当采用此方法同时考虑实验内和实验间变异时，许多在标准检验中看似统计显著的结果实际上与随机变异一致。此外，将该数据集中的实验分组为“预测器-目标”配对后，我们展示了该方法针对目标实验预测的复制概率（基于预测器实验结果及两个实验的样本量）与观测到的复制率高度相关。因此，分布零假设检验为研究者提供了一种统计工具，用于识别统计显著且可靠可复制的结果。