P-hacking is prevalent in reality but absent from classical hypothesis testing theory. As a consequence, significant results are much more common than they are supposed to be when the null hypothesis is in fact true. In this paper, we build a model of hypothesis testing with p-hacking. From the model, we construct critical values such that, if the values are used to determine significance, and if scientists' p-hacking behavior adjusts to the new significance standards, significant results occur with the desired frequency. Such robust critical values allow for p-hacking so they are larger than classical critical values. To illustrate the amount of correction that p-hacking might require, we calibrate the model using evidence from the medical sciences. In the calibrated model the robust critical value for any test statistic is the classical critical value for the same test statistic with one fifth of the significance level.
翻译:p-hacking在现实中普遍存在,但经典假设检验理论却未将其纳入考量。因此,当原假设实际为真时,显著结果的出现频率远高于理论预期。本文构建了一个包含p-hacking的假设检验模型,并基于该模型构造了临界值。若采用这些临界值判定显著性,且科学家的p-hacking行为随新显著性标准调整,则显著结果将以期望频率出现。此类稳健临界值允许p-hacking存在,因此数值大于经典临界值。为揭示p-hacking所需的修正幅度,我们利用医学领域的证据对模型进行校准。在校准模型中,任一检验统计量的稳健临界值等于该统计量在显著性水平为原值五分之一时的经典临界值。