We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for high power. We prove that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control in numerical experiments and on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.
翻译:我们提出了终止随机实验(T-Rex)选择器,一种面向高维数据的快速变量选择方法。T-Rex选择器在控制用户指定的目标错误发现率(FDR)的同时,最大化所选变量的数量。这是通过融合多个提前终止的随机实验的解实现的。这些实验基于原始预测变量与多组随机生成的虚拟预测变量的组合进行。我们提供了基于鞅理论的FDR控制性质的有限样本证明。数值模拟证实,FDR在目标水平上得到控制,同时保持较高的统计功效。我们证明,这些虚拟变量可以从任何具有有限期望和方差的单变量概率分布中采样。所提出方法的计算复杂度与变量数量呈线性关系。在数值实验和模拟的全基因组关联研究(GWAS)中,T-Rex选择器在FDR控制方面优于当前最先进的方法,同时其顺序计算时间比最强基准方法低两个数量级以上。包含T-Rex选择器实现的开源R包TRexSelector已在CRAN上发布。