We consider the problem of estimating the mean of a random variable Y subject to non-ignorable missingness, i.e., where the missingness mechanism depends on Y . We connect the auxiliary proxy variable framework for non-ignorable missingness (West and Little, 2013) to the label shift setting (Saerens et al., 2002). Exploiting this connection, we construct an estimator for non-ignorable missing data that uses high-dimensional covariates (or proxies) without the need for a generative model. In synthetic and semi-synthetic experiments, we study the behavior of the proposed estimator, comparing it to commonly used ignorable estimators in both well-specified and misspecified settings. Additionally, we develop a score to assess how consistent the data are with the label shift assumption. We use our approach to estimate disease prevalence using a large health survey, comparing ignorable and non-ignorable approaches. We show that failing to account for non-ignorable missingness can have profound consequences on conclusions drawn from non-representative samples.
翻译:我们研究了在不可忽略缺失机制下(即缺失机制依赖于随机变量Y)估计Y均值的难题。我们将不可忽略缺失场景中的辅助代理变量框架(West and Little, 2013)与标签偏移框架(Saerens et al., 2002)建立联系。利用这一关联,我们构建了一种无需生成模型即可利用高维协变量(或代理变量)处理不可忽略缺失数据的估计方法。在合成和半合成实验中,我们分析了该估计量的表现,并在模型正确设定与错误设定两种情形下与常用的可忽略估计量进行对比。此外,我们开发了一个评分指标用于检验数据与标签偏移假说的一致性。通过大型健康调查数据估计疾病患病率时,我们分别采用可忽略与不可忽略方法进行了实证比较,结果表明未能考虑不可忽略缺失机制将严重扭曲基于非代表性样本所得结论。