Background: Missing data is a pervasive problem in epidemiology, with complete records analyses (CRA) or multiple imputation (MI) the most common methods to deal with incomplete data. MI is valid when incomplete variables are independent of response indicators, conditional on complete variables - however, this can be hard to assess with multiple incomplete variables. Previous literature has shown that MI may be valid in subsamples of the data, even if not necessarily valid in the full dataset. Current guidance on how to decide whether MI is appropriate is lacking. Methods: We develop an algorithm that is sufficient to indicate when MI will estimate an exposure-outcome coefficient without bias and show how to implement this using directed acyclic graphs (DAGs). We extend the algorithm to investigate whether MI applied to a subsample of the data, in which some variables and complete and the remaining are imputed, will be unbiased for the same estimand. We demonstrate the algorithm by applying it to several simple examples and a more complex real-life example. Conclusions: Multiple incomplete variables are common in practice. Assessing the plausibility of each of CRA and MI estimating an exposure-outcome association without bias is crucial in analysing and interpreting results. Our algorithm provides researchers with the tools to decide whether (and how) to use MI in practice. Further work could focus on the likely size and direction of biases, and the impact of different missing data patterns.
翻译:背景:缺失数据是流行病学中的普遍问题,完整记录分析(CRA)或多重插补(MI)是处理不完整数据最常用的方法。当不完整变量在给定完整变量的条件下与响应指标独立时,MI是有效的——然而,在存在多个不完整变量的情况下,这一条件往往难以评估。已有文献表明,即使MI在整个数据集中不一定有效,但在数据的子样本中可能有效。目前关于如何判断MI是否适用的指导原则尚不完善。方法:我们开发了一种算法,该算法足以指示MI何时能无偏估计暴露-结局系数,并展示了如何利用有向无环图(DAGs)实现这一过程。我们将该算法扩展至探究:对数据子样本(其中部分变量完整而其余变量需插补)应用MI时,是否仍能对同一估计量保持无偏性。我们通过将该算法应用于若干简单示例及一个更复杂的现实案例来验证其有效性。结论:实践中多重不完整变量普遍存在。在分析和解释结果时,评估CRA与MI各自无偏估计暴露-结局关联的合理性至关重要。我们的算法为研究者提供了在实践中决定是否(及如何)使用MI的工具。后续研究可聚焦于偏倚的可能大小与方向,以及不同缺失数据模式的影响。