This article investigates uncertainty quantification of the generalized linear lasso~(GLL), a popular variable selection method in high-dimensional regression settings. In many fields of study, researchers use data-driven methods to select a subset of variables that are most likely to be associated with a response variable. However, such variable selection methods can introduce bias and increase the likelihood of false positives, leading to incorrect conclusions. In this paper, we propose a post-selection inference framework that addresses these issues and allows for valid statistical inference after variable selection using GLL. We show that our method provides accurate $p$-values and confidence intervals, while maintaining high statistical power. In a second stage, we focus on the sparse logistic regression, a popular classifier in high-dimensional statistics. We show with extensive numerical simulations that SIGLE is more powerful than state-of-the-art PSI methods. SIGLE relies on a new method to sample states from the distribution of observations conditional on the selection event. This method is based on a simulated annealing strategy whose energy is given by the first order conditions of the logistic lasso.
翻译:本文研究了广义线性Lasso(GLL)的不确定性量化问题,GLL是高维回归场景中一种流行的变量选择方法。在许多研究领域中,研究者使用数据驱动方法选择最可能与响应变量相关的变量子集。然而,此类变量选择方法可能引入偏差并增加假阳性风险,导致错误结论。本文提出一种选择后推断框架,该框架解决了上述问题,允许在使用GLL进行变量选择后进行有效的统计推断。我们证明该方法能够提供精确的p值和置信区间,同时保持较高的统计功效。在第二阶段,我们聚焦于稀疏逻辑回归——高维统计中一种流行的分类器。通过大量数值模拟表明,SIGLE比现有最优的PSI方法更具统计功效。SIGLE采用一种新方法,从条件于选择事件的观测分布中抽取状态,该方法基于模拟退火策略,其能量函数由逻辑Lasso的一阶条件给出。