Exploration and analysis of massive datasets has recently generated increasing interest in the research and development communities. It has long been a recognized problem that many datasets contain significant levels of missing numerical data. We introduce a mathematically principled stochastic optimization imputation method based on the theory of Kriging. This is shown to be a powerful method for imputation. However, its computational effort and potential numerical instabilities produce costly and/or unreliable predictions, potentially limiting its use on large scale datasets. In this paper, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is also significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show the multi-level method significantly outperforms current approaches and is numerically robust. In particular, it has superior accuracy as compared with methods recommended in the recent report from HCUP on the important problem of missing data, which could lead to sub-optimal and poorly based funding policy decisions. In comparative benchmark tests it is shown that the multilevel stochastic method is significantly superior to recommended methods in the report, including Predictive Mean Matching (PMM) and Predicted Posterior Distribution (PPD), with up to 75% reductions in error.
翻译:大规模数据集的分析近来在研究和开发领域引发了日益增长的兴趣。长期以来,一个公认的问题是许多数据集包含大量缺失的数值型数据。我们提出了一种基于克里金理论的数学严谨的随机优化插补方法。该方法被证明是一种强大的插补技术。然而,其计算开销和潜在的数值不稳定性会导致成本高昂且/或不可靠的预测,这可能限制其在大型数据集上的应用。本文针对大规模医疗记录中的缺失值填补问题,采用了一种最新发展的多层随机优化方法。该方法基于计算应用数学技术,具有高精度。特别地,对于最佳线性无偏预测器(BLUP)而言,这种多层公式是精确的,并且速度显著更快、数值稳定性更强。这使得克里金方法能够实际应用于大规模数据集的数据插补问题。我们在来自医疗保健研究与质量局(AHRQ)的全国住院样本(NIS)数据库(隶属于医疗成本与利用项目HCUP)上对提出的方法进行了测试。数值结果表明,多层方法显著优于现有方法,且具有数值鲁棒性。尤其值得注意的是,与HCUP近期关于缺失数据这一重要问题的报告中所推荐的方法相比,该方法具有更高的精度——而报告中推荐的方法可能导致次优甚至依据不足的财政政策决策。在基准对比测试中,多层随机方法显著优于报告中推荐的方法(包括预测均值匹配PMM和预测后验分布PPD),误差降低了高达75%。