Survey sampling is concerned with the estimation of finite population parameters. In practice, survey data suffer from item nonresponse, which is commonly handled through imputation, i.e., replacing missing values with predicted values. As a result, the properties of the resulting imputed estimator depend critically on the properties of the prediction method used. In turn, prediction methods themselves depend on the choice of variables and tuning parameters used to fit the imputation model. In this article, we study the problem of variable selection for linear regression imputation. Although variable selection has been widely studied across many fields, primarily for identification or prediction, its role in imputation for survey data has received comparatively little attention. We introduce the notion of an optimal imputation model defined through an oracle loss function and show that, with probability tending to one, the optimal model coincides with the true model. We also examine the consequences of using misspecified models -- either omitting relevant covariates or including irrelevant ones -- on consistency and asymptotic variance. We then develop a complete methodological framework for constructing confidence intervals after model selection. The proposed confidence intervals are shown to be asymptotically valid and optimal among all candidate models. Simulation studies indicate that the proposed methodology performs well in finite samples.
翻译:抽样调查关注于有限总体参数的估计。实践中,调查数据常面临项目无响应问题,通常通过插补处理——即用预测值替换缺失值。因此,所得插补估计量的性质关键取决于所用预测方法的特性。而预测方法本身又取决于拟合插补模型时所选变量及调优参数的设定。本文研究线性回归插补中的变量选择问题。尽管变量选择在众多领域(主要用于识别或预测)已得到广泛研究,但其在调查数据插补中的作用却鲜受关注。我们通过构建Oracle损失函数定义了最优插补模型的概念,并证明该最优模型以概率趋于1与真实模型重合。同时,我们检验了模型误设(遗漏相关协变量或包含无关变量)对估计量相合性与渐近方差的影响。在此基础上,我们建立了完整的模型选择后置信区间构建方法框架。理论证明所提置信区间在所有候选模型中具有渐近有效性与最优性。模拟研究表明,该方法在有限样本中表现良好。