We consider parameter estimation and inference when data feature blockwise, non-monotone missingness. Our approach, rooted in semiparametric theory and inspired by prediction-powered inference, leverages off-the-shelf AI (predictive or generative) models to handle missing completely at random mechanisms, by finding an approximation of the optimal estimating equation through a novel and tractable Restricted Anova hierarchY (RAY) approximation. The resulting Inference for Blockwise Missingness(RAY), or IBM(RAY) estimator incorporates pre-trained AI models and carefully controls asymptotic variance by tuning model-specific hyperparameters. We then extend IBM(RAY) to a general class of estimators. We find the most efficient estimator in this class, which we call IBM(Adaptive), by solving a constrained quadratic programming problem. All IBM estimators are unbiased, and, crucially, asymptotically achieving guaranteed efficiency gains over a naive complete-case estimator, regardless of the predictive accuracy of the AI models used. We demonstrate the finite-sample performance and numerical stability of our method through simulation studies and an application to surface protein abundance estimation.
翻译:本文研究数据存在分块非单调缺失情况下的参数估计与推断问题。我们的方法植根于半参数理论,并受预测驱动推断的启发,利用现成的人工智能(预测性或生成性)模型处理完全随机缺失机制,通过一种新颖且可处理的受限方差分析层次结构近似方法,寻找最优估计方程的近似解。由此产生的分块缺失推断(RAY)估计量,即IBM(RAY)估计量,整合了预训练的人工智能模型,并通过调整模型特定超参数精细控制渐近方差。随后,我们将IBM(RAY)扩展至更广泛的估计量类别。通过求解约束二次规划问题,我们在此类别中找到了最有效的估计量,称之为IBM(自适应)估计量。所有IBM估计量均具有无偏性,且关键之处在于,无论所使用人工智能模型的预测准确性如何,均能渐近地保证相较于朴素完整案例估计量的效率提升。我们通过模拟研究及表面蛋白丰度估计的应用案例,验证了该方法在有限样本下的性能表现与数值稳定性。