Simultaneously performing variable selection and inference in high-dimensional regression models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while controlling the False Discovery Rate (FDR) arising from the underlying multiple hypothesis testing. In this paper we propose the joint adoption of the Mirror Statistic approach to FDR control, coupled with outcome randomisation to maximise the statistical power of the variable selection procedure. Through extensive simulations we show how our proposed strategy allows to combine the benefits of the two techniques. The Mirror Statistic is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Outcome randomisation is an alternative to Data Splitting, that allows to generate two independent outcomes, which can then be used to estimate the coefficients that go into the construction of the Mirror Statistic. The combination of these two approaches provides increased testing power in a number of scenarios, such as highly correlated covariates and high percentages of active variables. Moreover, it is scalable to very high-dimensional problems, since the algorithm has a low memory footprint and only requires a single run on the full dataset, as opposed to iterative alternatives such as Multiple Data Splitting.
翻译:在高维回归模型中同时进行变量选择与推断是统计学和机器学习领域的一个开放性挑战。随着大量变量的日益普及,需要采用特定的统计程序在高维空间中准确选择最重要的预测变量,同时控制多重假设检验产生的错误发现率。本文提出联合采用镜像统计量方法控制FDR,并结合结果随机化以最大化变量选择过程的统计功效。通过大量仿真实验,我们展示了所提策略如何结合两种技术的优势。镜像统计量是一种灵活的FDR控制方法,仅需温和的模型假设,但需要两组独立的回归系数估计值(通常通过分割原始数据集获得)。结果随机化是数据分割的替代方案,可生成两个独立的结果变量,进而用于估计构建镜像统计量所需的系数。这两种方法的结合在高度相关的协变量和高比例活跃变量等场景下能提供更高的检验功效。此外,该方法可扩展至超高维问题,因为算法具有较低的内存占用,且仅需在完整数据集上运行一次,无需如多重数据分割等迭代替代方案。