We present an approach for modeling and imputation of nonignorable missing data. Our approach uses Bayesian data integration to combine (1) a Gaussian copula model for all study variables and missingness indicators, which allows arbitrary marginal distributions, nonignorable missingess, and other dependencies, and (2) auxiliary information in the form of marginal quantiles for some study variables. We prove that, remarkably, one only needs a small set of accurately-specified quantiles to estimate the copula correlation consistently. The remaining marginal distribution functions are inferred nonparametrically and jointly with the copula parameters using an efficient MCMC algorithm. We also characterize the (additive) nonignorable missingness mechanism implied by the copula model. Simulations confirm the effectiveness of this approach for multivariate imputation with nonignorable missing data. We apply the model to analyze associations between lead exposure and end-of-grade test scores for 170,000 North Carolina students. Lead exposure has nonignorable missingness: children with higher exposure are more likely to be measured. We elicit marginal quantiles for lead exposure using statistics provided by the Centers for Disease Control and Prevention. Multiple imputation inferences under our model support stronger, more adverse associations between lead exposure and educational outcomes relative to complete case and missing-at-random analyses.
翻译:本文提出一种针对不可忽略缺失数据的建模与插补方法。该方法通过贝叶斯数据整合技术,将以下两部分信息相结合:(1) 针对所有研究变量与缺失指示变量的高斯Copula模型,该模型允许任意边际分布、不可忽略的缺失机制及其他依赖关系;(2) 以部分研究变量的边际分位数形式存在的辅助信息。我们证明,仅需少量精确设定的分位数即可一致估计Copula相关系数,这一特性具有显著的理论价值。其余边际分布函数通过高效MCMC算法与Copula参数进行非参数联合推断。同时,我们刻画了该Copula模型所隐含的(可加型)不可忽略缺失机制。仿真实验验证了该方法在不可忽略缺失数据多元插补中的有效性。我们将该模型应用于分析北卡罗来纳州17万名学生铅暴露与期末考试成绩的关联性。铅暴露数据存在不可忽略缺失:暴露水平越高的儿童越可能被测量。我们利用美国疾病控制与预防中心提供的统计数据,推导出铅暴露的边际分位数。基于本模型的多重插补推断表明,相较于完全案例分析与随机缺失分析,铅暴露与教育结果之间存在更强、更负面的关联。