Inference with Sequential Monte-Carlo Computation of $p$-values: Fast and Valid Approaches

Hypothesis tests calibrated by (re)sampling methods (such as permutation, rank and bootstrap tests) are useful tools for statistical analysis, at the computational cost of requiring Monte-Carlo sampling for calibration. It is common and almost universal practice to execute such tests with predetermined and large number of Monte-Carlo samples, and disregard any randomness from this sampling at the time of drawing and reporting inference. At best, this approach leads to computational inefficiency, and at worst to invalid inference. That being said, a number of approaches in the literature have been proposed to adaptively guide analysts in choosing the number of Monte-Carlo samples, by sequentially deciding when to stop collecting samples and draw inference. These works introduce varying competing notions of what constitutes "valid" inference, complicating the landscape for analysts seeking suitable methodology. Furthermore, the majority of these approaches solely guarantee a meaningful estimate of the testing outcome, not the $p$-value itself $\unicode{x2014}$ which is insufficient for many practical applications. In this paper, we survey the relevant literature, and build bridges between the scattered validity notions, highlighting some of their complementary roles. We also introduce a new practical methodology that provides an estimate of the $p$-value of the Monte-Carlo test, endowed with practically relevant validity guarantees. Moreover, our methodology is sequential, updating the $p$-value estimate after each new Monte-Carlo sample has been drawn, while retaining important validity guarantees regardless of the selected stopping time. We conclude this paper with a set of recommendations for the practitioner, both in terms of selection of methodology and manner of reporting results.

翻译：基于（再）抽样方法（如置换检验、秩检验和自助法检验）校准的假设检验是统计分析的有用工具，其计算代价在于需要蒙特卡洛抽样进行校准。通常且几乎普遍的做法是使用预先确定的大量蒙特卡洛样本执行此类检验，并在得出和报告推断时忽略此抽样过程中的任何随机性。这种做法往好了说会导致计算效率低下，往坏了说会导致无效推断。尽管如此，文献中已提出多种方法，通过序贯决定何时停止收集样本并作出推断，来自适应地指导分析者选择蒙特卡洛样本的数量。这些工作引入了各种相互竞争的关于何为“有效”推断的概念，使得寻求合适方法的分析者面临复杂局面。此外，这些方法大多仅保证对检验结果的有意义估计，而非$p$值本身——这对于许多实际应用而言是不够的。本文梳理了相关文献，在分散的有效性概念之间搭建桥梁，并强调其中一些概念的互补作用。我们还提出了一种新的实用方法，该方法能提供蒙特卡洛检验$p$值的估计，并赋予其具有实际意义的有效性保证。此外，我们的方法是序贯的，在每次抽取新的蒙特卡洛样本后更新$p$值估计，同时无论选择何种停止时间，都能保持重要的有效性保证。本文最后为实践者提供了一套建议，涉及方法选择与结果报告方式两方面。