Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These post-hoc solutions, however, are often unsatisfying in that they are not guaranteed to yield concrete conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.
翻译:缺失数据在大多数应用场景中普遍存在,在电子健康记录(EHR)中尤为常见。当数据相对于已测量的协变量呈非随机缺失(MNAR)时,通常需要进行敏感性分析。然而,这些事后解决方案往往不尽如人意,因为它们无法保证得出确切结论。受一项基于EHR的减重手术后长期结局研究的启发,我们考虑采用双重抽样方法,在统计目标为因果效应的估计与推断时,缓解非随机缺失结局数据的问题。我们描述了在该设计下足以识别混杂因素、治疗方案和结局联合分布的假设条件。此外,我们推导了在非参数模型下以及假设结局实际上为随机缺失(MAR)的模型下,平均因果处理效应的有效且稳健的估计量。通过模拟研究,我们将这些估计量与一种基于MAR假设违反证据进行自适应估计的方法进行了比较。最后,我们还表明,所提出的双重抽样设计可扩展至处理任意粗化机制,并推导出任意光滑完整数据泛函的非参数有效估计量。