On the universal calibration of heavy-tailed combination tests

It is often of interest to test a global null hypothesis using multiple, possibly dependent $p$-values by combining their strengths while controlling the type-I error. Recently, several heavy-tailed combination tests, such as the harmonic mean test and the Cauchy combination test, have been proposed: they transform $p$-values into heavy-tailed random variables before combining them into a single test statistic. The resulting tests, which are calibrated under some form of independence assumption among the $p$-values, have been shown to be rather robust to dependence asymptotically as the $α$ level gets small. Yet, it has remained an open problem to understand this general phenomenon and characterize how such tests behave under dependence. Using the framework of multivariate regular variation from extreme value theory, we show that for a class of combination tests that are homogeneous, the asymptotic level of the test can be expressed using the angular measure under multivariate regular variation. This measure characterizes the dependence of the transformed heavy-tailed variables in their upper tails, or equivalently, the dependence of the $p$-values near zero. We use this result to study several tests. The harmonic mean test, which coincides with the Pareto linear combination test, is shown to be universally calibrated regardless of the tail dependence; further, this test is shown to be the only one that achieves universal calibration among all homogeneous heavy-tailed combination tests. In contrast, the Cauchy combination test is shown to be universally honest but often conservative; the Dunn-Šidák correction, also known as the Tippett's method, while being honest, is calibrated if and only if the underlying $p$-values are independent near zero. These theoretical findings are corroborated with simulations and an application to independence testing with survey data.

翻译：通常需要通过结合多个可能相关的 $p$ 值来检验全局零假设，同时控制第一类错误。近年来，提出了若干重尾组合检验方法，例如调和均值检验和柯西组合检验：它们先将 $p$ 值转化为重尾随机变量，再将其组合为单一检验统计量。这些检验方法在 $p$ 值满足某种独立性假设时经过校准，已被证明当显著性水平 $\alpha$ 较小时对相依性相当稳健。然而，理解这一普遍现象并刻画此类检验在相依性下的行为仍是待解决的关键问题。利用极值理论中的多元正则变差框架，我们证明：对于一类齐次组合检验，其渐近水平可通过多元正则变差下的角测度来表示。该测度刻画了变换后的重尾变量在上尾部的相依性，等价地，也刻画了 $p$ 值在零附近的相依性。我们利用此结果研究若干检验方法。结果表明：调和均值检验（等同于帕累托线性组合检验）无论尾部相依性如何均具有普适校准性；进一步，在所有齐次重尾组合检验中，该检验是唯一实现普适校准性的方法。相比之下，柯西组合检验虽具有普适诚实性但通常偏保守；而 Dunn-Šidák 校正（即 Tippett 方法）虽具有诚实性，但仅当 $p$ 值在零附近独立时才具备校准性。这些理论发现通过模拟实验及基于调查数据的独立性检验应用得到验证。