We consider random sample splitting for estimation and inference in high dimensional generalized linear models, where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that, no matter including a prespecified subset of regression coefficients or not, the debiased lasso estimation of the selected submodel after a single splitting follows a normal distribution asymptotically. Furthermore, for a set of prespecified regression coefficients, we show that a multiple splitting procedure based on the debiased lasso can address the loss of efficiency associated with sample splitting and produce asymptotically normal estimates under mild conditions. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood estimator in the estimation stage can vastly reduce the bias and variance of the resulting estimates. We illustrate the proposed multiple splitting debiased lasso method with an analysis of the smoking data of the Mid-South Tobacco Case-Control Study.
翻译:我们考虑随机样本分割在高维广义线性模型估计与推断中的应用:首先使用一个子样本通过Lasso选择子模型,随后利用剩余子样本应用去偏Lasso拟合所选模型。我们证明,无论是否包含预设的回归系数子集,单次分割后所选子模型的去偏Lasso估计均渐近服从正态分布。进一步,针对一组预设的回归系数,基于去偏Lasso的多重分割程序能在温和条件下克服样本分割带来的效率损失,并产生渐近正态的估计量。模拟结果表明,在估计阶段采用去偏Lasso替代常规最大似然估计器,可大幅降低估计结果的偏差与方差。我们通过中南部烟草病例对照研究的吸烟数据分析,展示了所提出的多重分割去偏Lasso方法。