Testing has developed into the fundamental statistical framework for falsifying hypotheses. Unfortunately, tests are binary in nature: a test either rejects a hypothesis or not. Such binary decisions do not reflect the reality of many scientific studies, which often aim to present the evidence against a hypothesis and do not necessarily intend to establish a definitive conclusion. We propose a continuous generalization of a test, which we use to continuously measure the evidence against a hypothesis. Such a continuous test can be viewed as a continuous and non-randomized interpretation of the classical `randomized test'. This offers the benefits of a randomized test, without the downsides of external randomization. Another interpretation is as a literal measure, which measures the amount of binary tests that reject the hypothesis. Our work unifies classical testing and the recently proposed $e$-values: $e$-values bounded to $[0, 1/\alpha]$ are continuously interpreted size $\alpha$ randomized tests. Choosing $\alpha = 0$ yields the regular $e$-value, which we use to define a level 0 continuous test. Moreover, we generalize the traditional notion of power by using generalized means. This produces a framework that contains both classical Neyman-Pearson optimal testing and log-optimal $e$-values, as well as a continuum of other options. The traditional $p$-value appears as the reciprocal of a generally invalid continuous test. In an illustration in a Gaussian location model, we find that optimal continuous tests are of a beautifully simple form.
翻译:检验已发展成为证伪假设的基本统计框架。遗憾的是,检验本质上具有二元性:一个检验要么拒绝假设,要么不拒绝。这种二元决策无法反映许多科学研究的现实情况,这些研究通常旨在呈现反对假设的证据,而不一定试图得出确定性结论。我们提出了一种检验的连续推广,用于连续度量反对假设的证据强度。这种连续检验可视为经典"随机化检验"的连续化与非随机化诠释,在保留随机化检验优点的同时避免了外部随机化的缺陷。另一种理解是将其视为一种度量工具,用于衡量拒绝该假设的二元检验的数量。我们的工作统一了经典检验与近期提出的e值:限定在[0, 1/α]范围内的e值可连续解释为显著性水平α的随机化检验。取α=0即得到常规e值,我们借此定义了水平0连续检验。此外,我们通过广义均值推广了传统的势概念,构建了一个同时包含经典Neyman-Pearson最优检验与对数最优e值的理论框架,以及一系列连续过渡的中间选项。传统p值则表现为一种通常无效的连续检验的倒数。在高斯位置模型的示例中,我们发现最优连续检验具有极其简洁的优美形式。