In a high dimensional regression setting in which the number of variables ($p$) is much larger than the sample size ($n$), the number of possible two-way interactions between the variables is immense. If the number of variables is in the order of one million, which is usually the case in e.g., genetics, the number of two-way interactions is of the order one million squared. In the pursuit of detecting two-way interactions, testing all pairs for interactions one-by-one is computational unfeasible and the multiple testing correction will be severe. In this paper we describe a two-stage testing procedure consisting of a screening and an evaluation stage. It is proven that, under some assumptions, the tests-statistics in the two stages are asymptotically independent. As a result, multiplicity correction in the second stage is only needed for the number of statistical tests that are actually performed in that stage. This increases the power of the testing procedure. Also, since the testing procedure in the first stage is computational simple, the computational burden is lowered. Simulations have been performed for multiple settings and regression models (generalized linear models and Cox PH model) to study the performance of the two-stage testing procedure. The results show type I error control and an increase in power compared to the procedure in which the pairs are tested one-by-one.
翻译:在高维回归设置中,当变量数量($p$)远大于样本量($n$)时,变量间可能的二阶交互作用数量极为庞大。若变量数量达到百万量级(例如遗传学中的常见情况),则二阶交互作用数量约为百万的平方量级。为检测二阶交互作用,若对所有交互对逐一进行检验,其计算量将不可行,且多重检验校正会非常严格。本文提出一种包含筛选阶段与评估阶段的两阶段检验流程。研究证明,在一定假设条件下,两阶段的检验统计量具有渐近独立性。因此,第二阶段仅需对该阶段实际执行的统计检验数量进行多重性校正,从而提升了检验流程的统计功效。同时,由于第一阶段检验计算简便,整体计算负担得以降低。研究通过多种设置与回归模型(广义线性模型及Cox比例风险模型)进行模拟,以评估两阶段检验流程的性能。结果显示,相较于逐对检验方法,该流程在控制Ⅰ类错误的同时提升了统计功效。