Large-scale adaptive multiple testing for sequential data controlling false discovery and nondiscovery rates

In modern scientific experiments, we frequently encounter data that have large dimensions, and in some experiments, such high dimensional data arrive sequentially rather than full data being available all at a time. We develop multiple testing procedures with simultaneous control of false discovery and nondiscovery rates when $m$-variate data vectors $\mathbf{X}_1, \mathbf{X}_2, \dots$ are observed sequentially or in groups and each coordinate of these vectors leads to a hypothesis testing. Existing multiple testing methods for sequential data uses fixed stopping boundaries that do not depend on sample size, and hence, are quite conservative when the number of hypotheses $m$ is large. We propose sequential tests based on adaptive stopping boundaries that ensure shrinkage of the continue sampling region as the sample size increases. Under minimal assumptions on the data sequence, we first develop a test based on an oracle test statistic such that both false discovery rate (FDR) and false nondiscovery rate (FNR) are nearly equal to some prefixed levels with strong control. Under a two-group mixture model assumption, we propose a data-driven stopping and decision rule based on local false discovery rate statistic that mimics the oracle rule and guarantees simultaneous control of FDR and FNR asymptotically as $m$ tends to infinity. Both the oracle and the data-driven stopping times are shown to be finite (i.e., proper) with probability 1 for all finite $m$ and converge to a finite constant as $m$ grows to infinity. Further, we compare the data-driven test with the existing gap rule proposed in He and Bartroff (2021) and show that the ratio of the expected sample sizes of our method and the gap rule tends to zero as $m$ goes to infinity. Extensive analysis of simulated datasets as well as some real datasets illustrate the superiority of the proposed tests over some existing methods.

翻译：在现代科学实验中，我们经常遇到高维数据，而某些实验中的高维数据是顺序到达而非一次性全部获取。针对按序或分组观测的$m$维数据向量$\mathbf{X}_1, \mathbf{X}_2, \dots$，且每个坐标对应一个假设检验的场景，本文开发了能同时控制错误发现率与错误未发现率的多重检验方法。现有序列数据多重检验方法采用不依赖样本量的固定停止边界，因此当假设数量$m$较大时过于保守。我们提出基于自适应停止边界的序列检验方法，确保抽样持续区域随样本量增大而收缩。在数据序列的最小假设条件下，首先基于理想检验统计量开发检验方法，使得错误发现率与错误未发现率均近似等于预设的严格控制水平。在双组混合模型假设下，我们提出基于局部错误发现率统计量的数据驱动停止与决策规则，该规则模拟理想规则，并保证当$m$趋于无穷大时渐近地同时控制错误发现率与错误未发现率。证明对于所有有限$m$，理想停止时间和数据驱动停止时间均以概率1有限（即合法），且当$m$趋于无穷大时收敛至有限常数。进一步将数据驱动检验与He和Bartroff（2021）提出的现有间隙规则比较，证明当$m$趋于无穷大时本方法与间隙规则的期望样本量之比趋于零。大量模拟数据集与真实数据集的综合分析表明，所提检验方法优于现有方法。