We propose a test of the significance of a variable appearing on the Lasso path and use it in a procedure for selecting one of the models of the Lasso path, controlling the Family-Wise Error Rate. Our null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. We focus on the regularization parameter value from which a first variable outside A is selected. As the test statistic, we use this quantity's conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate this by simulating outcome vectors and then calibrating them on the observed outcome's estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models in which it turns into an iterative stochastic procedure. We prove that the test controls the risk of selecting a false positive in linear models, both under the null hypothesis and, under a correlation condition, when A does not contain all active variables. We assess the performance of our procedure through extensive simulation studies. We also illustrate it in the detection of exposures associated with drug-induced liver injuries in the French pharmacovigilance database.
翻译:本文提出了一种检验方法,用于评估Lasso路径中出现的变量的显著性,并将其应用于Lasso路径模型选择过程中,以控制族错误率。我们的零假设依赖于一组已选变量A,并声明该集合包含所有活跃变量。我们重点关注从该集合外首次选择变量的正则化参数值。作为检验统计量,我们使用该参数值的条件p值,其定义条件为限制在A上的模型非惩罚估计系数。我们通过模拟结果向量并基于观测结果的估计系数进行校准来估计该p值。针对广义线性模型,我们采用启发式方法调整校准过程,将其转化为迭代随机过程。我们证明了在线性模型中,该检验能有效控制误选假阳性的风险:不仅在零假设下成立,且在满足相关性条件时,即使A未包含所有活跃变量亦成立。通过大量模拟研究,我们评估了该方法的性能,并在法国药物警戒数据库中检测药物性肝损伤相关暴露因素的应用中进行了实证展示。