StarTrek: Combinatorial Variable Selection with False Discovery Rate Control

Variable selection on the large-scale networks has been extensively studied in the literature. While most of the existing methods are limited to the local functionals especially the graph edges, this paper focuses on selecting the discrete hub structures of the networks. Specifically, we propose an inferential method, called StarTrek filter, to select the hub nodes with degrees larger than a certain thresholding level in the high dimensional graphical models and control the false discovery rate (FDR). Discovering hub nodes in the networks is challenging: there is no straightforward statistic for testing the degree of a node due to the combinatorial structures; complicated dependence in the multiple testing problem is hard to characterize and control. In methodology, the StarTrek filter overcomes this by constructing p-values based on the maximum test statistics via the Gaussian multiplier bootstrap. In theory, we show that the StarTrek filter can control the FDR by providing accurate bounds on the approximation errors of the quantile estimation and addressing the dependence structures among the maximal statistics. To this end, we establish novel Cram\'er-type comparison bounds for the high dimensional Gaussian random vectors. Comparing to the Gaussian comparison bound via the Kolmogorov distance established by \citet{chernozhukov2014anti}, our Cram\'er-type comparison bounds establish the relative difference between the distribution functions of two high dimensional Gaussian random vectors. We illustrate the validity of the StarTrek filter in a series of numerical experiments and apply it to the genotype-tissue expression dataset to discover central regulator genes.

翻译：大规模网络中的变量选择问题已在文献中得到广泛研究。尽管现有方法大多局限于局部泛函（尤其是图边），但本文关注于网络中离散枢纽结构的选取。具体而言，我们提出一种名为StarTrek滤波的推断方法，用于在高维图模型中选择度数超过特定阈值的枢纽节点，并控制错误发现率（FDR）。网络中枢纽节点的发现面临挑战：由于组合结构的存在，缺乏直接检验节点度数的统计量；多重检验问题中复杂的依赖关系难以刻画与控制。在方法论层面，StarTrek滤波通过高斯乘子自助法构建基于最大检验统计量的p值来克服这一难题。在理论上，我们证明StarTrek滤波能够通过提供分位数估计近似误差的精确界，并处理最大统计量间的依赖结构，从而控制FDR。为此，我们建立了高维高斯随机向量新的克拉梅尔型比较界。相较于契尔诺茹科夫等（2014）基于柯尔莫哥洛夫距离建立的高斯比较界，我们的克拉梅尔型比较界刻画了两个高维高斯随机向量分布函数间的相对差异。我们通过一系列数值实验验证了StarTrek滤波的有效性，并将其应用于基因型-组织表达数据集以发现核心调控基因。