This paper is concerned with inference on the conditional mean of a high-dimensional linear model when outcomes are missing at random. We propose an estimator which combines a Lasso pilot estimate of the regression function with a bias correction term based on the weighted residuals of the Lasso regression. The weights depend on estimates of the missingness probabilities (propensity scores) and solve a convex optimization program that trades off bias and variance optimally. Provided that the propensity scores can be consistently estimated, the proposed estimator is asymptotically normal and semi-parametrically efficient among all asymptotically linear estimators. The rate at which the propensity scores are consistent is essentially irrelevant, allowing us to estimate them via modern machine learning techniques. We validate the finite-sample performance of the proposed estimator through comparative simulation studies and the real-world problem of inferring the stellar masses of galaxies in the Sloan Digital Sky Survey.
翻译:本文关注当结果随机缺失时,对高维线性模型条件均值的推断问题。我们提出了一种估计量,该估计量将回归函数的Lasso先验估计与基于Lasso回归加权残差的偏差校正项相结合。权重依赖于缺失概率(倾向得分)的估计值,并通过求解一个在偏差与方差间达到最优平衡的凸优化程序得到。假设倾向得分能够被一致估计,则所提出的估计量在所有渐近线性估计量中具有渐近正态性和半参数有效性。倾向得分的一致估计速率本质上无关紧要,这使我们能够通过现代机器学习技术对其进行估计。我们通过比较性仿真研究以及斯隆数字巡天中星系恒星质量推断的实际问题,验证了所提估计量的有限样本性能。