Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation with fixed values) and stochastic imputation (i.e., single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This paper aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model.
翻译:缺失数据是分析流行病学数据时的常见挑战,插补常被用于解决此问题。本文研究了分析中使用的协变量存在缺失并进行插补的情况。现有建议主张将分析模型中的结局变量纳入缺失协变量的插补模型,但该建议是否始终成立及其原因尚不明确。我们考察了确定性插补(即基于固定值的单一插补)和随机性插补(即基于随机值的单一或多次插补)方法,及其对估计插补协变量与结局变量之间关系的影响。通过数学论证表明,在使用随机性插补方法时,将结局变量纳入插补模型不仅是建议,更是实现无偏结果的必要条件。此外,我们澄清了关于确定性插补模型的常见误解,并论证了为何不应在此类模型中包含结局变量。本文旨在弥合插补理论与实践的差距,通过数学推导解释常见的统计建议,帮助深入理解缺失协变量插补中的考量因素,并强调何时必须将结局变量纳入插补模型。