We study nonparametric estimation of univariate cumulative distribution functions (CDFs) pertaining to data missing at random. The proposed estimators smooth the inverse probability weighted (IPW) empirical CDF with the Bernstein operator, yielding monotone, $[0,1]$-valued curves that automatically adapt to bounded supports. We analyze two versions: a pseudo estimator that uses known propensities and a feasible estimator that uses propensities estimated nonparametrically from discrete auxiliary variables, the latter scenario being much more common in practice. For both, we derive pointwise bias and variance expansions, establish the optimal polynomial degree $m$ with respect to the mean integrated squared error, and prove the asymptotic normality. A key finding is that the feasible estimator has a smaller variance than the pseudo estimator by an explicit nonnegative correction term. We also develop an efficient degree selection procedure via least-squares cross-validation. Monte Carlo experiments show that, for small to moderate sample sizes, the Bernstein-smoothed pseudo and feasible estimators outperform their unsmoothed counterparts and the integrated version of the IPW kernel density estimator proposed by Dubnicka (2009), under certain models. A real-data application to fasting plasma glucose from the 2017-2018 NHANES survey illustrates the method in a practical setting. All code needed to reproduce our analyses is readily accessible on GitHub.
翻译:我们研究了随机缺失数据情形下单变量累积分布函数(CDF)的非参数估计。提出的估计器利用伯恩斯坦算子对逆概率加权(IPW)经验CDF进行平滑处理,生成单调且取值于[0,1]的自适应有界支撑曲线。我们分析了两个版本:使用已知倾向得分的伪估计器,以及使用从离散辅助变量非参数估计的倾向得分的可行估计器——后者在实践中更为常见。针对两者,我们推导了逐点偏差和方差展开式,建立了关于均方积分误差的最优多项式阶数$m$,并证明了渐近正态性。一个关键发现是,可行估计器通过显式的非负修正项实现了比伪估计器更小的方差。我们还通过最小二乘交叉验证开发了一种高效的阶数选择程序。蒙特卡洛实验表明,在中小样本量下,伯恩斯坦平滑后的伪估计器和可行估计器在某些模型中优于未经平滑的对应方法以及Dubnicka(2009)提出的积分版本IPW核密度估计器。针对2017-2018年NHANES调查中空腹血糖数据的实际应用展示了该方法在实践中的效果。所有复现分析所需的代码均可通过GitHub公开获取。