比较多元正态均值问题中的三种学习-检验范式 (Comparing three learn-then-test paradigms in a multivariate normal means problem)

Many modern procedures use the data to learn a structure and then leverage it to test many hypotheses. If the entire data is used at both stages, analytical or computational corrections for selection bias are required to ensure validity (post-learning adjustment). Alternatively, one can learn and/or test on masked versions of the data to avoid selection bias, either via information splitting or null augmentation}. Choosing among these three learn-then-test paradigms, and how much masking to employ for the latter two, are critical decisions impacting power that currently lack theoretical guidance. In a multivariate normal means model, we derive asymptotic power formulas for prototypical methods from each paradigm -- variants of sample splitting, conformal-style null augmentation, and resampling-based post-learning adjustment -- quantifying the power losses incurred by masking at each stage. For these paradigm representatives, we find that post-learning adjustment is most powerful, followed by null augmentation, and then information splitting. Moreover, null augmentation can be nearly as powerful as post-learning adjustment, while avoiding its challenges: the power of the former approaches that of the latter if the number of nulls used for augmentation is a vanishing fraction of the number of hypotheses. We also prove for a tractable proxy that the optimal number of nulls scales as the square root of the number of hypotheses, challenging existing heuristics. Finally, we characterize optimal tuning for information splitting by identifying an optimal split fraction and tying it to the difficulty of the learning problem. These results establish a theoretical foundation for key decisions in the deployment of learn-then-test methods.

翻译：许多现代统计程序利用数据学习结构，随后借助该结构检验多重假设。若全部数据在两个阶段均被使用，则需通过解析或计算校正选择偏倚以确保有效性（后学习调整）。或者，可通过信息分割或零值增广的方式，在数据的掩蔽版本上进行学习和/或检验，从而避免选择偏倚。在这三种学习-检验范式之间进行选择，以及为后两种范式确定掩蔽程度，是影响统计功效的关键决策，目前缺乏理论指导。在多元正态均值模型中，我们推导了各范式典型方法——样本分割变体、类conformal零值增广以及基于重采样的后学习调整——的渐近功效公式，量化了各阶段掩蔽导致的功效损失。对于这些范式代表方法，我们发现后学习调整功效最高，其次为零值增广，信息分割最低。此外，零值增广的功效可接近后学习调整，同时规避其挑战：若用于增广的零值数量是假设数量的可忽略部分，前者的功效将趋近后者。我们还证明，对于一个可处理的代理问题，最优零值数量与假设数量的平方根成正比，这对现有启发式方法提出了挑战。最后，我们通过确定最优分割比例并将其与学习问题的难度相关联，刻画了信息分割的最优调参策略。这些结果为学习-检验方法部署中的关键决策建立了理论基础。