Blockwise missing data occurs frequently when we integrate multisource or multimodality data where different sources or modalities contain complementary information. In this paper, we consider a high-dimensional linear regression model with blockwise missing covariates and a partially observed response variable. Under this framework, we propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations and a blockwise imputation procedure, and obtain its rate of convergence. Furthermore, building upon an innovative projected estimating equation technique that intrinsically achieves bias-correction of the initial estimator, we propose a nearly unbiased estimator for each individual regression coefficient, which is asymptotically normally distributed under mild conditions. Based on these debiased estimators, asymptotically valid confidence intervals and statistical tests about each regression coefficient are constructed. Numerical studies and application analysis of the Alzheimer's Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.
翻译:分块缺失数据在整合多源或多模态数据时经常出现,不同来源或模态包含互补信息。本文考虑带有分块缺失协变量和部分观测响应变量的高维线性回归模型。在此框架下,我们基于精心构造的无偏估计方程和分块插补程序,提出一种计算高效的回归系数向量估计量,并得到其收敛速度。进一步,基于一种创新的投影估计方程技术(该技术本质上实现了初始估计量的偏差校正),我们为每个回归系数提出近似无偏的估计量,该估计量在温和条件下渐近服从正态分布。基于这些去偏估计量,构建了每个回归系数的渐近有效置信区间和统计检验。数值实验及阿尔茨海默病神经影像学倡议数据的应用分析表明,所提方法比现有方法表现更优,且能从无监督样本中获得更多收益。