Validity of Complete Case Analysis Depends on the Target Population

Missing data is a pernicious problem in epidemiologic research. Research on the validity of complete case analysis for missing data has typically focused on estimating the average treatment effect (ATE) in the whole population. However, other target populations like the treated (ATT) or external targets can be of substantive interest. In such cases, whether missing covariate data occurs within or outside the target population may impact the validity of complete case analysis. We sought to assess bias in complete case analysis when covariate data is missing outside the target (e.g., missing covariate data among the untreated when estimating the ATT). We simulated a study of the effect of a binary treatment X on a binary outcome Y in the presence of 3 confounders C1-C3 that modified the risk difference (RD). We induced missingness in C1 only among the untreated under 4 scenarios: completely randomly (similar to MCAR); randomly based on C2 and C3 (similar to MAR); randomly based on C1 (similar to MNAR); or randomly based on Y (similar to MAR). We estimated the ATE and ATT using weighting and averaged results across the replicates. We conducted a parallel simulation transporting trial results to a target population in the presence of missing covariate data in the trial. In the complete case analysis, estimated ATE was unbiased only when C1 was MCAR among the untreated. The estimated ATT, on the other hand, was unbiased in all scenarios except when Y caused missingness. The parallel simulation of generalizing and transporting trial results saw similar bias patterns. If missing covariate data is only present outside the target population, complete case analysis is unbiased except when missingness is associated with the outcome.

翻译：缺失数据是流行病学研究中的一个顽固问题。关于缺失数据采用完全案例分析有效性的研究通常集中于估计整个人群的平均处理效应（ATE）。然而，其他目标人群如已处理人群（ATT）或外部人群也具有实质性的研究价值。在这种情况下，缺失协变量数据是否发生在目标人群内部或外部，可能影响完全案例分析的有效性。我们旨在评估当协变量数据在目标人群之外缺失时（例如，在估计ATT时未处理组中协变量数据缺失）完全案例分析中的偏倚。我们模拟了一项关于二元处理变量X对二元结局变量Y影响的研究，其中存在3个混杂因素C1-C3，这些因素会改变风险差（RD）。我们在4种场景下仅对未处理组诱导了C1的缺失：完全随机缺失（类似MCAR）；基于C2和C3随机缺失（类似MAR）；基于C1随机缺失（类似MNAR）；或基于Y随机缺失（类似MAR）。我们采用加权法估计ATE和ATT，并对多次重复结果取平均值。我们进行了并行模拟，将试验结果推广到存在缺失协变量数据的目标人群。在完全案例分析中，仅当C1在未处理组中为MCAR时，估计的ATE无偏。另一方面，除Y导致缺失的场景外，估计的ATT在所有场景中均无偏。关于试验结果的一般化与迁移的并行模拟显示出相似的偏倚模式。如果缺失协变量数据仅出现在目标人群之外，则完全案例分析无偏，除非缺失与结局相关时。